lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.34k stars 798 forks source link

Is it possbile to do dimension reduction on a single vector? #686

Open PosoSAgapo opened 3 years ago

PosoSAgapo commented 3 years ago

Hi, I am trying to do a research on representation of machine learning models and some features of objects, which requires a correlation caculation. The machine learning model outputs a vector representation, the feature of objects is also a vector. However, one is 768 length vector, another is 1124 length vector. I would like to reduce the dimension of 1124 length vector into 768, is it possbile for umap to do dimension reduction on a single vector? From what I know, demention reduction technique is mostly working on feature reduction rather than vector reduction.

AlexanderGroeger commented 3 years ago

I wouldn't assume you can reduce dimensionality without having a distribution to work with. Can you modify the machine learning model architecture so that the vector lengths match?

lmcinnes commented 3 years ago

I don't believe this is tractable. UMAP representations are only unique up to translation and rotation. Solving that problem for a single vector is as easy as putting the vector at the zero vector -- and that is what UMAP would do in theory (in practice I'm not sure what it would do -- I would have to check to code to see how it handles such corner cases).

Any neighbor based technique (so also t-SNE, Isomap, etc.) is going to have the same issue. The next option is a matrix factorization technique. Unfortunately those are ill-posed since with n=1 and k=768 you have k >> n and there are no factorizations (you need n >= k at a minimum).

Your best bet is to try for some dimension reduction of datasets of the 768 dim vectors and the 1124 length vectors to some common dimensionality while also trying to align the relevant information. The alignment is a challenging problem unless you have some way to pair up (at least some of) the data from 768 to 1124 dimensions (e.g. if there is a corresponding 1124 dimensional vector for each 768 dimensional vector). If you have the latter then at least something may be possible with AlignedUMAP, but we definitely need to be in a pretty specific case.

PosoSAgapo commented 3 years ago

I don't believe this is tractable. UMAP representations are only unique up to translation and rotation. Solving that problem for a single vector is as easy as putting the vector at the zero vector -- and that is what UMAP would do in theory (in practice I'm not sure what it would do -- I would have to check to code to see how it handles such corner cases).

Any neighbor based technique (so also t-SNE, Isomap, etc.) is going to have the same issue. The next option is a matrix factorization technique. Unfortunately those are ill-posed since with n=1 and k=768 you have k >> n and there are no factorizations (you need n >= k at a minimum).

Your best bet is to try for some dimension reduction of datasets of the 768 dim vectors and the 1124 length vectors to some common dimensionality while also trying to align the relevant information. The alignment is a challenging problem unless you have some way to pair up (at least some of) the data from 768 to 1124 dimensions (e.g. if there is a corresponding 1124 dimensional vector for each 768 dimensional vector). If you have the latter then at least something may be possible with AlignedUMAP, but we definitely need to be in a pretty specific case.

I do have the one-to-one corresponding vector between the 1124 and 768 length vector. Is AlignedUMAP a variant of UMAP? Or it is a specific case that I have to modify it based on my own dataset.

lmcinnes commented 3 years ago

AlignedUMAP is an extension of UMAP to perform multiple UMAP embeddings that are aligned with each other / in the same space. There is some documentation for it here and here.

The plan for your use case would be to provide a list of the two datasets, and as relations a dictionary mapping indices of points in the first dataset to indices of points in the second dataset. Thus is the zeroth element of the 768 dim dataset corresponds to the 23rd element of the 1124 dim data set you would have relation_dict[0] = 23 and so on. If you have both datasets sorted (so that the first entry corresponds to the first, the second to the second and so on, you will still need the relation dictionary, but it will be fairly trivial to construct.

PosoSAgapo commented 3 years ago

AlignedUMAP is an extension of UMAP to perform multiple UMAP embeddings that are aligned with each other / in the same space. There is some documentation for it here and here.

The plan for your use case would be to provide a list of the two datasets, and as relations a dictionary mapping indices of points in the first dataset to indices of points in the second dataset. Thus is the zeroth element of the 768 dim dataset corresponds to the 23rd element of the 1124 dim data set you would have relation_dict[0] = 23 and so on. If you have both datasets sorted (so that the first entry corresponds to the first, the second to the second and so on, you will still need the relation dictionary, but it will be fairly trivial to construct.

Ok, I will see what I can do. Thanks!

lmcinnes commented 3 years ago

If it works out please do let me know -- potentially we can document it, or something like it, as a different use case for AlignedUMAP.