Closed wcbeard closed 3 years ago
Hi @wcbeard
could you provide more detail on how your algorithm works? Is there a paper describing it?
Some questions I have when reading your example:
a
and b
? Is it the rows/columns of the left/right orthogonal matrix?n_components
) random numbers? From the top of my head I can't think of a convincing argument.import numpy as np; import sklearn.decomposition as dc
dc.NMF(n_components=10).fit_transform(np.array([[2, 1, 0], [1, 0, 1]]))
Out[7]:
array([[1.96833465e+00, 2.45553355e+00, 0.00000000e+00, 3.59352116e+00,
0.00000000e+00, 0.00000000e+00, 8.81498780e-04, 7.40528238e-01,
1.26965131e-03, 0.00000000e+00],
[1.03898403e+00, 0.00000000e+00, 7.74788494e-01, 1.53559444e-02,
5.91203856e-01, 6.52982072e-01, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.80639046e-02]])
In regular usage it would only be useful to set it for dimensions smaller than the cardinality of the column.
a
and b
will give 5 dimensional vectors for each variable. The convention I'm using here is to create a column a__b1
for the first dimension of the vector for a
created from a co-occurrence matrix with variable b
, so you also get the reverse, a vector value for the variable b
from its co-occurrences with variable a
from the same co-occurrence matrix.I mainly did this because for my own purposes I wanted both versions, and it was a cheaper way than specifying both CoocurrenceEncoder(col_pairs=[("aa", "bb")])
and CoocurrenceEncoder(col_pairs=[("bb", "aa")])
since I can reuse the same co-occurrence matrix. But it might be a good idea to only keep one of those for a less confusing API.
Yes, I believe so. I get the a => a__b
mapping from nmf.components_, and the a => b__a
mapping from running nmf.transform on the co-occurrence matrix.
Hmmm...good question. If I had to guess, I'd make an analogy to recommender systems. Assuming the process that jointly generates both variables can be explained by a smaller number of latent variables (like topics or genres), then I'd think this method should be able to recover them, and that they would be more useful than the one-hot encoded versions (or random id's).
Sorry, I don't know of a paper. I can say empirically it significantly improved scores on the Amazon employee dataset above what mean encoding alone was able to get me.
Thanks for your detailed explanations - I understand now how the encoder works.
However, I'm still reluctant to add it to the library for the lack of theory. My best guess for your explanation is that the dimensionality reduction leads to two things:
other
label While I think both is definitely useful, I don't necessarily think it is strictly an encoder but rather a general feature engineering step. With this in mind I'd like to see a benchmarking of this strategy vs hot-encoding + dimensionality reduction (hence doing 1 and 2) or grouping some labels to an other
label and use some encoder. Ideally this would be done on some of the common academic benchmarking datasets and the results published in some blog post. This is quite a lot of work to do obviously, but in my opinion a guess as to why it works is too little to add it to the library.
That sounds fair, and you've given me some things to think about too. I probably won't have time for a fair treatment of the topic via a blog post in the near future, so I'll close for now.
I recently wrote an encoder that takes pairs of columns, generates a co-occurrence matrix, and runs SVD on it to reduce the dimensionality.
I have been working on a PR, but now I am realizing that it may not fit the API here. Instead of taking individual columns and mapping each one to a new column, it takes a pair of columns, and maps them to multiple new columns. For example, if you choose to use SVD to reduce the co-occurrence matrix to five dimensions, it will result in each column in the pair getting mapped to five new columns.
Do you see any way this could be made to fit in this repo, or is the format to different?