Open Hellisotherpeople opened 4 months ago
Playing with UMAP currently! I have it working but it's pretty funky, needs small coefficients. Doesn't seem to be a huge improvement over PCA currently, but it's possible the way I'm doing it isn't ideal. Might include it experimentally in the upcoming release!
(Generating a vector with UMAP is also ~30x slower than PCA currently.)
Very interesting!
Given the issues you describe with performance of training, there is a CuML GPU implementation of UMAP (and a lot of other dimensionality reduction algorithms which could be offered) - https://docs.rapids.ai/api/cuml/stable/api/#umap - certainly a larger dependency chain but these days everyone's accepted nvidia's stack as being mandatory so it might be good to make optional at least.
I think there is some tuning you can do with base UMAP's hyperparamaters to improve speed and possibly the quality of the generated control vectors. A UMAP expert would be able to look over that and make sure it's set "correctly" given the data - unfortunately that is not me (and likely fewer than 100 of them exist in the world).
As far as to why it requires smaller coefficients and why the performance may be hard to quantify as better - I'd love to see some analysis about this from others in the community, or even the UMAP creator himself (or at least one of the aformentioned 100)
I'm extremely appreciative that you have implemented it yourself and tried it. Very happy to see such rapid response and that it might even be made available to others. Thank you!!!
umap
is now experimentally supported as an (undocumented) option in #34 — use ControlVector.train(..., method="umap")
, and ensure the umap-learn
package is installed.
Please feel free to use this issue to continue discussing umap and potential improvements! I'm not sure if the current method is the ideal usage of it.
Thanks @vgel for all this.
I don't have a GPU and have little free time for quite some time still but I'm still very curious as to wether nonlinear dim reduction work "better".
Here are a few thoughts:
Anyway, I won't have time for about 6-12 months but may do a PR eventually.
If anyone's interested, please share your findings, especially negative results!
Addendum to my thoughts above (I hope nobody will mind!):
There's a whole large body of work on dimensionality reduction which handles non linearity better - i.e. UMAP. https://umap-learn.readthedocs.io/en/latest/
Is it simple to just "drop" this in place of PCA and get theoretically better results? If not, why?
what about other things, like NMF https://en.wikipedia.org/wiki/Non-negative_matrix_factorization ?