API compatibility of co-ocurrence encoder

wcbeard commented 3 years ago

I recently wrote an encoder that takes pairs of columns, generates a co-occurrence matrix, and runs SVD on it to reduce the dimensionality.

I have been working on a PR, but now I am realizing that it may not fit the API here. Instead of taking individual columns and mapping each one to a new column, it takes a pair of columns, and maps them to multiple new columns. For example, if you choose to use SVD to reduce the co-occurrence matrix to five dimensions, it will result in each column in the pair getting mapped to five new columns.

xsamp = [
    ("a", "b", 1),
    ("a", "b", 1),
    ("a", "d", 1),
    ("z", "d", 1),
    ("c", "b", 1),
]
xsamp_df = pd.DataFrame(xsamp, columns=["aa", "bb", "cc"])
coe = CoocurrenceEncoder(col_pairs=[("aa", "bb")], n_components=5).fit(xsamp_df)
X2 = coe.transform(xsamp_df)
X2
=>
   cc   aa__bb1   aa__bb2   aa__bb3   aa__bb4  aa__bb5   bb__aa1   bb__aa2       bb__aa3   bb__aa4   bb__aa5
0   1  0.486512  0.350539  0.246833  0.439192  0.00000  0.213629  0.000000  2.277219e-16  0.849826  0.000000
1   1  0.486512  0.350539  0.246833  0.439192  0.00000  0.213629  0.000000  2.277219e-16  0.849826  0.000000
2   1  0.486512  0.350539  0.246833  0.439192  0.00000  0.316185  0.419931  0.000000e+00  0.000000  0.047791
3   1  0.000000  0.433212  0.003929  0.000000  2.49231  0.316185  0.419931  0.000000e+00  0.000000  0.047791
4   1  0.000000  0.000000  0.000000  0.354154  0.00000  0.213629  0.000000  2.277219e-16  0.849826  0.000000

Do you see any way this could be made to fit in this repo, or is the format to different?

PaulWestenthanner commented 3 years ago

Hi @wcbeard

could you provide more detail on how your algorithm works? Is there a paper describing it?
Some questions I have when reading your example:

The co-occurrence matrix is 3x2, how can you "reduce" dimensionality to 5x5?
The co-occurrence matrix is symmetrical, so why do we add two cols for aa-bb and bb-aa and why do they differ?
How exactly do I get the mapping for the labels of a and b? Is it the rows/columns of the left/right orthogonal matrix?
Why is this a good encoder? Why are correlation scores with other categorical variables a good predictor? I've never seen a similar approach (e.g. just using the Cramer's V score). Why does this conserve predictive power of a feature better than mapping each label to one (or n_components) random numbers? From the top of my head I can't think of a convincing argument.

wcbeard commented 3 years ago

Good point, a better example would have higher cardinality categorical variables. The vectors here would come from what you'd get running SVD on a smaller dimension vector:

import numpy as np; import sklearn.decomposition as dc
dc.NMF(n_components=10).fit_transform(np.array([[2, 1, 0], [1, 0, 1]]))

Out[7]:
array([[1.96833465e+00, 2.45553355e+00, 0.00000000e+00, 3.59352116e+00,
        0.00000000e+00, 0.00000000e+00, 8.81498780e-04, 7.40528238e-01,
        1.26965131e-03, 0.00000000e+00],
       [1.03898403e+00, 0.00000000e+00, 7.74788494e-01, 1.53559444e-02,
        5.91203856e-01, 6.52982072e-01, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 1.80639046e-02]])

In regular usage it would only be useful to set it for dimensions smaller than the cardinality of the column.

Setting n_components to 5 for the co-occurrence matrix of variables a and b will give 5 dimensional vectors for each variable. The convention I'm using here is to create a column a__b1 for the first dimension of the vector for a created from a co-occurrence matrix with variable b, so you also get the reverse, a vector value for the variable b from its co-occurrences with variable a from the same co-occurrence matrix.

I mainly did this because for my own purposes I wanted both versions, and it was a cheaper way than specifying both CoocurrenceEncoder(col_pairs=[("aa", "bb")]) and CoocurrenceEncoder(col_pairs=[("bb", "aa")]) since I can reuse the same co-occurrence matrix. But it might be a good idea to only keep one of those for a less confusing API.

Yes, I believe so. I get the a => a__b mapping from nmf.components_, and the a => b__a mapping from running nmf.transform on the co-occurrence matrix.
Hmmm...good question. If I had to guess, I'd make an analogy to recommender systems. Assuming the process that jointly generates both variables can be explained by a smaller number of latent variables (like topics or genres), then I'd think this method should be able to recover them, and that they would be more useful than the one-hot encoded versions (or random id's).

Sorry, I don't know of a paper. I can say empirically it significantly improved scores on the Amazon employee dataset above what mean encoding alone was able to get me.

PaulWestenthanner commented 3 years ago

Thanks for your detailed explanations - I understand now how the encoder works.
However, I'm still reluctant to add it to the library for the lack of theory. My best guess for your explanation is that the dimensionality reduction leads to two things:

De-correlate the two input variables
Group labels for high-cardinality categorical variables in a somehow smarter way than just introducing an other label

While I think both is definitely useful, I don't necessarily think it is strictly an encoder but rather a general feature engineering step. With this in mind I'd like to see a benchmarking of this strategy vs hot-encoding + dimensionality reduction (hence doing 1 and 2) or grouping some labels to an other label and use some encoder. Ideally this would be done on some of the common academic benchmarking datasets and the results published in some blog post. This is quite a lot of work to do obviously, but in my opinion a guess as to why it works is too little to add it to the library.

wcbeard commented 3 years ago

That sounds fair, and you've given me some things to think about too. I probably won't have time for a fair treatment of the topic via a blog post in the near future, so I'll close for now.

scikit-learn-contrib / category_encoders

API compatibility of co-ocurrence encoder #324