Parallelizing the optimization step (redux)

jlmelville / uwot

An R package implementing the UMAP dimensionality reduction method.

https://jlmelville.github.io/uwot/

GNU General Public License v3.0

318 stars 31 forks source link

Parallelizing the optimization step (redux) #83

Closed LTLA closed 2 years ago

LTLA commented 3 years ago

To recap for other readers: it is currently possible to parallelize the optimization step by setting the (non-default) n_sgd_threads > 1. However, this is not ideal because the code is subject to race conditions, meaning that the results are not reproducible even when the seed is set. This makes it difficult to recommend - maybe not so bad for visualization, but still pretty bad, especially when you're on a manuscript deadline and you're trying to recreate a figure.

Unfortunately, parallelization of the optimizations is not straightforward as later iterations are dependent on the results of the earlier iterations. The current code just ignores these dependencies, resulting in race conditions. Ideally, parallelization would occur transparently, giving the same results as the serial version in less time. I spent a few days this week exploring some approaches based on locking or some other type of synchronization. This was largely disappointing, the overhead of thread management offset (and then some) any benefits from work sharing across threads.

I settled on approximating the optimization step by computing the updates first before applying them in batch. That is, gradients and deltas are computed for all points first before any changes to the coordinates of each point. In contrast, the current approach computes the deltas for each point and applies them immediately before proceeding to the next point. This batch approach is much more amenable to parallelization as each point's calculations are now independent.

The actual code is here; I use OpenMP but it should be straightforward to adapt to a different framework. Each parallelizable iteration contains all edges corresponding to a single head point, which simplifies book-keeping. The neatest part is the definition of the seeds in the serial section before the creation of a per-iteration RNG; this enables the reproducible creation of iteration-specific random number streams. In the PCG32 case, you'd just need to combine 2 32-bit ints to create a 64-bit seed.

See the (previously mysterious) "umappp batch" figure in #80 for an example of the effect of batching. This approach is similar to how t-SNE performs its updates, and is analogous to the difference between the Hartigan-Wong (immediate) and Lloyd algorithm (batch) for k-means. Based on this analogy, I would speculate that, while batching enables convenient parallelization, this comes at the cost of stability/accuracy of the optimizations in the early steps. Nonetheless, it could be a useful mode for getting a quicker reproducible answer for big datasets; my testing indicates that it gives a 3-4-fold speed up with 10 threads.

jlmelville commented 3 years ago

Nonetheless, it could be a useful mode for getting a quicker reproducible answer for big datasets; my testing indicates that it gives a 3-4-fold speed up with 10 threads.

Are you planning on doing any more testing of this? For example, with different numbers of threads and different dataset sizes? Is the speed up you are getting with 10 threads using a default number of iterations? Say I am using the MNIST digits dataset with 4 threads: is there a noticeable change in speed/visualization quality with this approach?

Right now UMAPPP offers the Hogwild!-style asynchronous SGD (the current implementation for uwot and the "real" UMAP Python package) as well as this new batch optimization. Is it planned to keep both modes? It sounds like that's the case, I just want to check.

LTLA commented 3 years ago

Are you planning on doing any more testing of this?

Yes, I'll try to put together something over the week. Need to constantly remind myself to set n_epoch= in uwot::umap to get comparable timings - constantly getting puzzled by umap's incredible performance, before realizing that it was doing 2.5-fold fewer iterations! Anyway, some quick testing on MNIST indicates that the batched method cuts it from 50 seconds to 25 seconds. (Omitting the time spent by the nearest neighbor search, which is constant for all methods.)

Is it planned to keep both modes?

That's right, though I don't parallelize the "standard" implementation, that's strictly serial. Which I think is fine as a default, the updates are probably more coherent when applied immediately rather than in batch.

SamGG commented 3 years ago

Great job Aaron. Thanks for the community.

jlmelville commented 3 years ago

@LTLA I am trying to formulate a plan for how to refactor towards umappp. I think the first step is to get a version of your batch approach into uwot. For now, I will freely adapt the code rather than import umappp directly. I have a couple of questions about the umappp implementation:

In terms of the main optimization loop, it looks like the main difference is that standard UMAP updates positive pair coordinates directly (lcopy and rcopy in umappp, head_embedding and tail_embedding in uwot): batch UMAP updates only the buffer vector with double the magnitude of the update. From the comment there, the idea is to store (and eventually apply) the update only to the head_embedding. I guess this is to avoid different threads updating parts of the data structure outside their safe data "window"?

Also, for the batch update code there is some clever stuff between iterations that switches the buffer passed to the optimization loop to avoid fully copying.

Finally, the code uses double everywhere. If you recall #46 I switched everything to float to increase reproducibility between 32-bit and 64-bit. Any suggestions on handling this? Should I just go back to double?

If there are other gotchas I should be aware of, please let me know.

LTLA commented 3 years ago

I am trying to formulate a plan for how to refactor towards umappp. I think the first step is to get a version of your batch approach into uwot.

Great. I also made a start with a small sandbox to get some timings and results, as suggested above. I actually tried doing this directly in my uwot fork but the C++ code isn't quite a plug-and-play replacement for the existing uwot code, so I just took an easier route.

I guess this is to avoid different threads updating parts of the data structure outside their safe data "window"?

Yep, that's right. In batched mode, I believe the update should be symmetric (i.e., A->B's force is the same as B->A's force if the coordinates are not changed in the meantime). So I just double the update to avoid touching the memory of the tail observation when computing for the head observation.

Any suggestions on handling this? Should I just go back to double?

Mostly my use of double is driven by my muscle memory. In fact, I remember thinking to myself, "I'll template this later", which is what I do for some other libraries, e.g., knncolle. I guess this would be a good time.

LTLA commented 3 years ago

I finally got around to making some timings on the MNIST dataset.

uwot, 6 threads, 500 epochs = 102 seconds
umappp, 6 threads, 500 epochs = 101 seconds
umappp batched, 6 threads, 500 epochs = 65 seconds

Though it must be said that the batched one looks a little wonky. For uwot:

For umappp without batching:

For umappp with batching:

The source code is here, and I get similar conclusions from the fashion MNIST.

jlmelville commented 3 years ago

Hmm, I wonder if batching needs a different learning rate (alpha) or more epochs.

jlmelville commented 3 years ago

Another possible source of wonk could be the asymmetric update. @LTLA if you go back to updating both sides of the positive edges (and run it with one thread) does that have an effect? That would at least decouple the effect of the asymmetric attractive update from the effect of not updating the coordinates within the epoch.

jlmelville commented 3 years ago

Here are two images from my own independent implementation each with 200 epochs and single threaded.

The "in-place" one is the way uwot currently works (i.e. updating the coordinates immediately). The "batch" result calculates the gradients using the previous epoch's coordinates, so should be equivalent to how the umappp batch would work with one thread and if the update rule for positive pairs hadn't been altered to only apply to one point.

in-place batch

The wonk is clearly present in my batch case as well, so at least it's reassuring to know that we get the same sort of effect with two different implementations, and that the effect is due to the coordinate update strategy (nothing to do with only moving one side of the positive pair).

Increasing the number of epochs by a factor of ten (n_epochs = 2000) in the batch case gives:

batch2000

This is much more normal-looking and in line with what I would expect a better-converged-than-typical-defaults UMAP for the MNIST digits to look like. So almost certainly what we are dealing with here is a similar mechanism to the observation that stochastic gradient descent tends to converge faster than normal gradient descent. Some potential strategies:

Do nothing. As long as the trade-off between more epochs vs more threads favors the more threads scenario, overall it's a win.
Look at either a different learning rate (easy) or a different learning rate schedule (only mildly more work).
I have given this no real thought at all, but as we are already paying the cost of storing an extra vector of the old coordinates, adding some extra storage (I think it "only" needs one extra vector) to keep track of the previous update would allow the use of momentum, which might help. There are obvious downsides to this (more storage requirements, another parameter to document and twiddle) even if it is amenable to working with the negative sampling optimization approach (I've not studied other areas that use negative sampling to see how optimization is achieved there, so I have no idea).
Edit: I forgot this one: instead of waiting to update coordinates until after an entire epoch has passed, update sooner, e.g. after a fixed number of edges have been processed. This is similar to the multi-threading strategy in pynndescent, where there is a tension between updating the heap data structures and maintaining throughput of distance calculations.

Anyway, I will continue to chip away at implementing batch mode in uwot at my usual snail's pace.

LTLA commented 3 years ago

Huh, I wasn't able to get that with num_epochs = 2000 in my code:

blah

Bit of a puzzle. I'll have to stare at it for a little bit.

More generally - hmm - I was hoping that the batch update would be more of a plug-and-play substitute. Worked okay on my tests but I guess it missed more subtle structure. Maybe it's just not possible to parallelize the optimization effectively.

FWIW the minibatch approach (i.e., your last suggestion) sounds the most promising to me.

jlmelville commented 3 years ago

I changed the attractive update to be asymmetric as in umappp and the results seem to be closer to umappp:

batch2ka

The effect seems most pronounced on the elongated cluster (of the digit '1' IIRC).

A worst-case scenario would be for symmetric updates where each thread would have to accumulate the updates in a local list, and then have a subsequent single threaded step iterate over the lists and apply them. If the gradient calculation is a lot slower than applying the update it might still be a win (at the potentially substantial cost of the extra storage).

LTLA commented 3 years ago

Just spent an evening trying to parallelize the tight inner loops (i.e., when we iterate over the dimensions) with SIMD autovectorization. Not very promising, I'm afraid, just sliced 1 second off a total runtime of 15 seconds in my test case. Probably not worth it, given that we would have to compile with -march=native and the resulting binaries wouldn't work across different architectures, which would basically make it unusable for HPCs with heterogeneous compute nodes.

(Rather annoyingly, -O3 -march=native is pretty impressive, cutting 5 seconds off the runtime. Annoying because we can't really set it - only the user can, otherwise we end up with problems above.)

Well, I'm at a loss. It really does seem that the only way to get closer to the canonical UMAP result is to compromise on performance in some manner that decreases the benefit of parallelizing in the first place. I was hoping that the difference would be negligible, and to be fair, the results are very qualitatively similar... but it does look different enough that I couldn't recommend it as an unqualified replacement for the standard approach.

IMO it's still better than the irreproducible parallelization scheme that uwot currently has. But at this point, maybe it's better to just not offer any parallelization of the optimization step at all.

jlmelville commented 3 years ago

Minor update: I added the asymmetric update to the Python UMAP (just commenting out the tail_embedding update and multiplying the attractive head_embedding update by 2) and I get similar results (banana-shaped 1 cluster), so that makes it less likely to be a bug in uwot and umappp (not that I thought it was) and seems to really be a side effect of doubling the magnitude of the update:

a_e200_lr1

More epochs does sort it out:

a_e2000_lr01

Maybe the more vigorous application of a presumably poor estimate of the gradient (whatever the true gradient actually is anyway) in one helping in the asymmetric update can lead to initial distortions that take longer to ameliorate? In that case, mini-batching would help. Or scaling down the learning rate and increasing the number of epochs to account for it. Then maybe batching would only be a win over the single-threaded SGD if there were more than a threshold number of threads, but if it's a small enough number (like say at least 4 threads) that would still be useful.

LTLA commented 3 years ago

Very interesting. Yes, your hypothesis makes sense, I was also thinking that the initial iterations would be more unstable so the algorithm probably ended up in a strange state that needed more epochs to sort out.

Maybe ... we could do single-threaded for the first x iterations, and then switch to batch mode for the remaining iterations. Sort of like how t-SNE has an early exaggeration phase to get the overall structure right.

The later iterations don't do much anyway - check this out:

mnist

jlmelville commented 3 years ago

@LTLA that's a very neat gif. Is there currently (or planned to be) that kind of functionality in umappp? Adding a callback functionality to the C++ part of uwot after each epoch to plot this kind of thing would be interesting (to me at least).

Anyway, I have been prodding at my batch implementation a bit more, and I have a question: how does umappp organize its data structure and parallelization for the batch optimization, specifically the edges? uwot stores the "head" nodes of each directed edge in the positive_head vector and the tails are in the equivalent locations in positive_tail. Parallelization is over these edge lists and there are varying numbers of neighbors for each node due to the symmetrization step, so there is no guarantee about which (or how many) data windows/threads a node will be in (adopting the strategy of only processing the head nodes can't fix this issue).

This means that for my batch implementation I am currently forced to store each gradient update in ~a vector of vectors~ edit: its own gradient-sized contiguous chunk of a larger vector (of overall size n_threads * embedding.size()), then in a subsequent non-parallel step accumulating them into a single gradient vector of size n_nodes, followed by applying the gradient descent update. Apart from being slower (and a potential issue with different rounding errors if the results are summed in different order due to different numbers of threads), there is also the issue that the per-thread PRNG seed will result in different negative samples with different numbers of threads. This means that results are not reproducible as n_sgd_threads is varied. Is all of this a non-issue in umappp?

LTLA commented 3 years ago

@LTLA that's a very neat gif. Is there currently (or planned to be) that kind of functionality in umappp? Adding a callback functionality to the C++ part of uwot after each epoch to plot this kind of thing would be interesting (to me at least).

Sure, that's the main reason for the existence of the epoch_limit argument in the various umappp::Umap::run() methods. The idea is to run to the specified epoch number and then return so that we can save the embedding at that state. Then restart the algorithm by calling run() with a higher epoch limit, capture the embedding again, and repeat until you get to the max epoch number. yaumap has some demo code, returning to R every ticks epochs to capture the progression of the algorithm.

The GIF generation code is in the animations.Rmd vignette, if you want to see what happens on the R side. Note that the GIF rendering (not the UMAP, just the rendering) takes forever, the MNIST example must have taken 10 minutes. Not sure what I'm doing wrong there, maybe there's a smarter way to do it.

I must say, the UMAP animations are pretty sedate. The t-SNE ones are much wilder, lots of action going on there.

how does umappp organize its data structure and parallelization for the batch optimization, specifically the edges?

The physical storage is pretty much inherited from uwot, i.e., a vector of head (*), another vector of epochs_per_sample. The key is that I don't loop across the edges directly. Instead I loop across the unique head values, and for each head, I have another inner loop that processes all of the edges associated with that head value. The outer loop is parallelized, so each head value is only being operated on by one thread at a time. Then, if an asymmetric update is used for the attractive forces, the coordinates of a head value will only ever be modified in the outer iteration for that head, making it all thread-safe.

The PRNG seeding is handled by generating a vector of seeds at the start of each epoch. One seed is generated for each head value, which is used to seed another PRNG inside each outer iteration. This gives me a reproducible stream of random numbers that (more-or-less) is different for each head value. Technically, this approach would be even better with the PCG32 generator as it's easier to explore the full state space (e.g., generate 2 32-bit numbers, slap them together to create a 64-bit seed for PCG32) and it's safer to generate truly head-specific streams (set the stream to the index of the outer loop).

(*): I just noticed that I think my head is your tail and vice versa. Oops. But seeing as how the edges are symmetric, it shouldn't matter in terms of correctness, but it does explain why the final plots are different between umappp and uwot.

jlmelville commented 3 years ago

Thank you for the detailed answer, which I will mull over at length. I may stick with my currently memory-hogging alternative while I investigate whether a more sophisticated optimization method or mini-batches will bring the number of epochs down to a level which is comparable to the in-place update, but your way seems a lot better.

I must say, the UMAP animations are pretty sedate. The t-SNE ones are much wilder, lots of action going on there.

Yes, I've seen that too: presumably you observe it early on? It's because of the early exaggeration phase. The spectral initialization of UMAP should approximate that part of the t-SNE optimization.

(*): I just noticed that I think my head is your tail and vice versa. Oops. But seeing as how the edges are symmetric, it shouldn't matter in terms of correctness, but it does explain why the final plots are different between umappp and uwot.

~I think this is my oops, as I recently checked and uwot is the wrong way round compared to the Python UMAP. I suspect catching all the places where this matters will be a real pain to fix so it might never happen, although I would like to be consistent.~

Edit 3: I managed to oops my oops. I just double-checked and uwot is the same way round as Python UMAP. Where it's the "wrong" way round is in when you transform new data. In that case, Python UMAP has the head be the ordered indices of the new data (whose coordinates will move) and the tail is the unordered indices of the original data (which will not move). uwot correctly has the head data be the new nodes, but they are ordered. But I may have done that on purpose back in the mists of time to get one optimization function moving the right coordinates. And for parallelizing over the head nodes only to work, it would need to be this way around. I think.

Edit: also, does your threading scheme rely on the structure of head/tail in any way (e.g. assuming indices are ordered in one of the arrays) or does every thread iterate over the entire node list checking if the head or tail index is inside its window, on the assumption that this is always fast compared to doing the actual work of calculating the gradient and updating the embedding?

Edit edit: to answer my own question, as the tail is always sorted (at least in uwot) you can iterate over an array of pointers in the style of a sparse compressed matrix. Duh.

jlmelville commented 3 years ago

Forgot to mention that the batch branch now contains a version of @LTLA's batch approach. This required a fairly hefty re-write of the optimization code and lacks any of the simplicity or clarity of the umappp implementation, but it does use the same techniques and achieves the desired effect of allowing multiple threads and reproducibility.

It does also currently require more epochs to optimize, so it will stay on a branch until I work out a suitable optimization scheme (and then undo my typically over-enthusiastic application of templates which has finally brought the Rtools compiler to its knees and hugely increased compilation times).

Getting there, though.

LTLA commented 3 years ago

Sounds like progress.

For a completely left-field idea: I managed to "parallelize" my broader workflow by running the UMAP iterations concurrently with other stuff that I needed to do. Specifically, I would run the neighbor detection in parallel, and then I would give that information to a separate thread that would perform the iterations. On another thread, I would perform other steps of my workflow, e.g., clustering and t-SNE. In this manner, I managed to keep all threads occupied, even if the UMAP itself was not parallelized.

Yeah, kind of cheating. But it does avoid all the headaches of trying to modify UMAP's logic, and this should be implementable with uwot right now, given that umap can already take a neighbor list as input. In my case, I was doing this via OpenMP, but that's only because I was already operating in C++. I believe R users/developers would be able to implement this fairly easily by calling parallel::clusterApply() with a FORK cluster (on non-Windows, otherwise a socket cluster).

In fact, uwot could even provide a function for parallelized parameter sweeps. It seems that quite a few people will run umap multiple times with different seeds/parameters, just to make sure they have a representative view of the dataset. It could be fairly valuable + straightforward to parallelize that, which would provide the benefits without the uncertainties around the batch mode.

jlmelville commented 2 years ago

Like a zombie, slasher movie antagonist or undead monster of your choice, this issue thread has risen from the grave and is plodding inexorably forward.

The quite-similar-to-UMAP method PacMAP uses Adam as its optimizer, and seems to give good results, so it's likely I will use that. However, the MNIST digits are a pretty easy case, so I have been spending a lot of my free time benchmarking various parameter settings and datasets. For cases where batch+Adam with defaults introduces distortions or splitting of clusters that aren't present in the current UMAP optimization, I would like to know what (if anything) can be done to ameliorate the situation.

Also I got sucked into reading the dizzyingly large number of papers advocating various tweaks to Adam.

@LTLA I suspect parameter sweeps will go on my long list of things to get around to eventually, but I don't know exactly what would be helpful for uwot to provide vs someone just writing a function to do it themselves. Probably helpful as a separate issue, but also I don't want to waste your time documenting something I don't know when I will get around to.

LTLA commented 2 years ago

Very cool. I'll admit I don't quite understand the subtleties of the parametrization of the gradient descent, but if we can get a speed-up without our ad hoc batch mode, I'm all for that. I'll probably just piggy-back off your C++ implementation again when the time comes, but no rush; I think the current code in uwot/umappp is pretty fast as it is.

I'll do the honors and drive a stake through the heart of this thread.

jlmelville commented 2 years ago

The issue refuses to die: the optimizer will be used with the batch branch. The question at this point is whether there is a good set of parameters which will:

avoid having to use (a lot) more epochs than the non-batch implementation for a similar result. For example, PacMAP uses 450 iterations which is larger than the default 200 iterations for larger datasets in UMAP. But it also uses more sampling which might explain that requirement (probably: I have only read the paper, not studied the implementation or admittedly actually run it myself yet).
avoid distorting some datasets which are not distorted by the non-batch implementation, or at least provide an obvious parameter which can be "turned down" to fix the problem (potentially at the cost of more epochs).

jlmelville commented 2 years ago

The above PR will be merged shortly. As discussed above, it also introduces batch mode and uses Adam for optimization, similarly to PacMAP. I recommend using at least 500 epochs with batch mode, so larger datasets and those which need longer optimizations will see the benefit from using this in combination with multiple threads.

Additionally, an epoch_callback parameter has been added. This is yet another innovation from umappp.

There will not be a new CRAN submission for this functionality until I have experimented with it a bit more, extended it to work with umap_transform, and then the usual assorted documentation and testing.

Apologies to @LTLA for me saying I would try to get uwot integrated with umappp and then completely failing. But at least functionality-wise they are better aligned for a future effort.

jlmelville commented 2 years ago

Forgot to mention: I did also look at mini-batches, but they introduced some stability issues when used with momentum. There are any number of potential causes and solutions, but there is already a trade-off introduced with mini-batches versus thread utilization. Rather than go wandering down yet another rabbit-hole for another couple of months, I have declared batch bankruptcy: there is a lot of room for improvement, but it may have benefits for some users as it is. Hence the merge.

LTLA commented 2 years ago

Sweet - sorry for the late reply - this looks great. Looking forward to playing around with it. What's the ballpark % speedup?

jlmelville commented 2 years ago

Sweet - sorry for the late reply - this looks great. Looking forward to playing around with it. What's the ballpark % speedup?

For the MNIST digits and n_epochs = 500: with n_sgd_threads = 0, batch = FALSE umap takes about 130 seconds on my hexcore (8th gen Intel) laptop. With n_sgd_threads = 4, batch = TRUE it goes down to 37 seconds.

For the tumap method the relative times are 28 seconds and 9 seconds.

So a speed up of ~3x with 4 threads. Arguably the non-batch mode is better optimized for any given set of epochs, but there's not a good way for me to measure the equivalent optimization state between the two modes. I would expect the speed up to be more impressive with umappp.

Caveat: I wouldn't be surprised if I have introduced a memory leak or other misfortune due to some sporadic strange behavior I have witnessed in RStudio on Windows. Sadly it wasn't the usual immediate R Session explosion variety that makes tracking things down easier. I will be attempting to valgrind it a bit before this gets anywhere close to a CRAN submission.

jlmelville commented 2 years ago

Caveat: I wouldn't be surprised if I have introduced a memory leak or other misfortune due to some sporadic strange behavior I have witnessed in RStudio on Windows. Sadly it wasn't the usual immediate R Session explosion variety that makes tracking things down easier. I will be attempting to valgrind it a bit before this gets anywhere close to a CRAN submission.

Did a bit of valgrinding, couldn't get it report any errors, so might be a false alarm there.