lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.37k stars 799 forks source link

Partial fit / Incremental UMAP #62

Open parashardhapola opened 6 years ago

parashardhapola commented 6 years ago

Hi @lmcinnes,

There has been quite an exciting about UMAP in scRNA-Seq area recently. However, fitting larger data-sets might be an issue depending on amount of available memory. I would love to see a 'partial_fit' method (akin to one available in scikit's incremental PCA) in UMAP, so that data could be lazily loaded and fitted. I do realize that this might be non-trivial, if not theoretically infeasible, but your recent implementation of 'transform' method got my hopes high. It would be nice to have your thoughts on this.

Thanks.

lmcinnes commented 6 years ago

If I understand you correctly you want essentially an "out-of-core" fit method that doesn't require the whole dataset to be in memory at one time?

I believe that this is possible, but would likely be slow. Essentially what is really needed is a fast, out-of-core, approximate nearest neighbor algorithm. I believe what I have now may actually work with dask arrays which would provide out-of-core, but for the fact that check_array converts it to an array at the start, and modulo how numba handles dask (I suspect it doesn't). In other-words I think this is algorithmically quite tractable, but might require a fair amount of code to actually make it happen.

If there is significant interest in this (and it seems there may be) I'll see if I can figure out how to actually make this happen.

parashardhapola commented 6 years ago

Yes indeed, I'm looking for an out-of-core fit method and I'm sure that it would be of interest to many.

Pardon me if this is a very naive suggestion but I was thinking that may be this functionality can initially be implemented downstream of distance calculation. Such that, in effect, a user needs to provide a generator which yields distance vector for each sample (of shape m for an m x m distance matrix), instead of complete pre-computed matrix itself. This may avoid, for now, the requirement of an out-of-core ANN algorithm.

stsievert commented 6 years ago

Essentially what is really needed is a fast, out-of-core, approximate nearest neighbor algorithm.

Is this what we were working on during the SciPy sprints on some branch/fork https://github.com/lmcinnes/pynndescent? I'd like to see the code – I'm curious what progress has been made, and would be happy to help out more.

lmcinnes commented 6 years ago

Yes, that's the one. I got basic code hammered out, but hadn't fixed all the bugs. I also chatted with Matt Rocklin and it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required. Still, slow but tractable is better than impossible.

stsievert commented 6 years ago

it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required.

I believe that. I sounds like the proposed solution will be enabled by https://github.com/dask/dask/issues/3409.

Is there anywhere I can review the code you've implemented?

lmcinnes commented 6 years ago

I didn't get it working to a point of pushing it up anywhere. I have a colleague who may be picking up this work, so I might try and loop him in with all of this on Monday.

On Wed, Jul 18, 2018 at 5:22 PM Scott Sievert notifications@github.com wrote:

it looks like the proposed solution would be quite slow due to the random(ish) access of a large distributed array required.

I believe that. I sounds like the proposed solution will be enabled by dask/dask#3409 https://github.com/dask/dask/issues/3409.

Is there anywhere I can review the code you've implemented?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/62#issuecomment-406078490, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBaibBPDXe0VsSWXNDl6qUba2nyejks5uH6b6gaJpZM4Ts6Gq .

mrocklin commented 6 years ago

On the dask side I think that https://github.com/dask/dask/issues/3409#issuecomment-405254656 is probably the next step to try (assuming I understand the situation on the UMAP side (which is doubtful)) :)

soupault commented 5 years ago

Really looking forward to it! Computer vision domain would benefit quite a lot from an out-of-core implementation.

ksangeeta2429 commented 5 years ago

Was there something already released for this? I am also looking for the same for acoustic deep learning purpose. I am looking for using UMAP over some TB of data.

lmcinnes commented 5 years ago

Currently this is still, at best, an "in the works" issue. The first real hurdle to be overcome, the nearest neighbor search, is looking good. There is a paralllel NN search, and there is reasonable hope to make a few more changes to ensure it is dask enabled for distributed arrays -- this has all been put the the PyNNDescent library, which will become a dependency for umap in the future. The next step is distributed SGD. This is easier, but has had less work done. Speculatively I would suggest that these are features that may arrive in version 0.5. Note, of course, that 0.4 is not even released yet. I suspect that means you may have to use subsampling for your project now. In good news the structures found my umap are generally fairly robust to subsampling.

stsievert commented 5 years ago

next step is distributed SGD.

https://github.com/dask/dask/issues/3409 is related (and mentioned above). It kept popping up in my Dask distributed SGD implementation (https://github.com/dask/dask-glm/pull/69).

https://github.com/dask/dask/pull/3901 resolves this issue (thanks to @TomAugspurger). Now, shuffling to a distributed Dask array is much quicker (~100x improvement, 13.7s vs 140ms).

teenooy commented 4 years ago

would really like to see the incremental training feature for UMAP. I work with both image and text data, UMAP easily generates the best manifold with the least amount of tuning compared to everything else I've tried in both categories.

lmcinnes commented 4 years ago

At present it looks like this is something that might be possible in the v0.5 to v0.6 time-frame. That's not soon, but it may yet happen.

ghost commented 4 years ago

I am writing to mention that I am very interested in this.

ghost commented 4 years ago

A possible method to do a large scale nearest neighbor search efficiently would be to use something like k-d trees or this. I had attempted this a while ago and found that it works for lower dimensional data like vgg16. I tried to use it on some very high dimensional genomics applications and had no luck.

I’ll try to dig up this code or my notes on this.

galinaalperovich commented 4 years ago

Would be happy to see this incremental UMAP feature!

lmcinnes commented 4 years ago

Depending on what exactly you are looking for, this may now be available (in preliminary form) in the 0.5dev branch. To make it work you would use the new, special, AlignedUMAP class, and the update functionality therein. I have a notebook gist that provides a basic walkthrough of these sorts of features. With this approach if your goal is to incrementally add (batches) of new points to an existing embedding, updating the embedding as you go, this should suffice. It may not meet all of your needs, but hopefully it will solve some problems for some users.

fedden commented 3 years ago

Thanks for the early release and consideration. Just wondering if there is an approach to take wrt the relation dictionaries if the successive instances in the dataset do not depend on each other (no overlap) and as such do not have relations between successive frames.

E.g I tried passing in empty dicts with the hope it would be interpreted as no-relations between successive instances:

reducer.fit([data], relations=[{} for _ in data][:-1])

But get an error roughly saying it's expecting non-0 len dicts.

lmcinnes commented 3 years ago

That's definitely a bug -- as you say, it should treat that as no relation. I presume you have a dataset you want aligned except at some intermediate slice where you lack relations? I'll have to dig in a little and see if this is easy to fix. Having at least some relation is baked in to the assumptions right now, but I can see that having empty relations occasionally does make sense.

irl-dan commented 1 year ago

Did this ever make it into any released version of the package? Or any plans on doing so? I'm also interested in this.

jmdelahanty commented 1 year ago

I too have an application potentially that would be very helpful to have this for!

AtomicCactus commented 1 year ago

Just want to +1 this request, an incremental out-of-core fit method would be awesome!

SbstnErhrdt commented 1 year ago

Joining +1

JdbermeoUZH commented 1 year ago

+1!

ECburx commented 1 year ago

+1

mahabenfares commented 1 year ago

Hello, I want to use UMAP for data streams. Do you offer an incremental version of UMAP that I can use please? I saw that you talked about Alignedumap, I wanted to test it, but I don't know what to put in the relation dictionaries.

lmcinnes commented 1 year ago

I think a lot of this is beyond what the basic UMAP approach can do as it is essentially transductive. For those who really need this I would suggest your best option is going to be using the ParametricUMAP option and using new batches to retrain. If you need pure streaming you can use the transform on ParametricUMAP for the streaming data, and store batches for batch based retraining on a cycle that is reasonable given the retrain cost.

big-o commented 1 year ago

I'm not sure I fully understand all of the issues, but is a scalable nearest neighbour search still a blocker? If so, I've had good successes with the FAISS implementation of an approximate NN search (went from days with the standard ball tree stuff to seconds), and it can be wrapped up fairly trivially in a sklearn API. The downside is that it limits you to Euclidean and Cosine distance measures only - not sure if this limitation is due to the algorithm or because other distances just haven't been implemented yet.

lagunabravo commented 1 year ago

Thank you @lmcinnes for your great work. I’m uncertain about how to use retraining on ParametricUMAP as you suggested on Mar 4. I guess what I don’t understand is how retraining modifies the embedder in regard to different batches on a large dataset. Should I split the dataset into batches, train/retrain across all of them and them use the final embedder to apply transform across the whole dataset? Or, alternatively, should I apply transform after retraining for each individual batch?

AlekseyFedorovich commented 1 year ago

I would love this feature!!

123mitnik commented 1 year ago

+1