FFTW vs numpy.FFT speed gain

dkobak commented 4 years ago

The README says:

To squeeze the most out of openTSNE, you may also consider installing FFTW3 prior to installation. FFTW3 implements the Fast Fourier Transform, which is heavily used in openTSNE. If FFTW3 is not available, openTSNE will use numpy’s implementation of the FFT, which is slightly slower than FFTW. The difference is only noticeable with large data sets containing millions of data points.

I have not tried openTSNE with FFTW. Is there no noticeable difference on e.g. MNIST? I see FIt-SNE being almost 2x faster than openTSNE on MNIST (n=70000) with most of the speed-up taken by the gradient descent, not the affinity matrix construction. I'd expect Cython to be almost as fast as compiled C++, and not 2x slower. But of course openTSNE has to switch back and forth between Cython and numpy, so I could imagine that could cause the slowdown. Would it be avoided by using FFTW?

On my laptop FIT-SNE takes ~90 seconds, openTSNE with n_jobs=-1 takes ~180 seconds. That's with 50-dimensional input, after PCA.

dkobak commented 4 years ago

I went ahead and tried it out: I installed fftw from conda-forge, recompiled openTSNE (it said FFTW3 header files found. Using FFTW implementation of FFT) and reran my code. I got ~120 seconds.

Looks like a big speed-up!

Given that fftw is available on conda-forge, what's the benefit of not depending on it by default?

In any case, I would suggest to more explicitly write in the README that one could install via

conda install --channel conda-forge fftw
conda install --channel conda-forge opentsne

to make it much faster.

pavlin-policar commented 4 years ago

That's quite significant indeed. I never really did benchmark this without FFTW I suppose, and it should be the default in that case. The timing is a bit curious though. In my benchmarks, I found openTSNE to actually be marginally faster than FIt-SNE when using 8 cores and slower when using 1 core. I haven't figured out why yet. So it's interesting to see that openTSNE is slower than FIt-SNE using multiple cores on your end. What CPU, OS, and Python version are you using, if you don't mind me asking? I ran all the benchmarks on an Intel Xeon, which is probably different from laptop CPUs.

Would it be avoided by using FFTW?

Yes, if it uses FFTW, there's no need to go back into Python-land and I call the FFTW libraries directly.

conda install --channel conda-forge opentsne

wouldn't be the correct thing to do here I think because the conda packages come precompiled. FFTW needs to be available at compile-time, so on the conda-forge servers. The correct thing to do would probably to compile openTSNE using python setup.py install.

This is something that can probably be done, but I realised after the release yesterday that Annoy doesn't provide wheels, and it doesn't provide Windows conda builds. So that's going to have to take priority, since I can release a new version onto conda-forge until that's taken care of.

dkobak commented 4 years ago

In my benchmarks, I found openTSNE to actually be marginally faster than FIt-SNE when using 8 cores and slower when using 1 core. I haven't figured out why yet. So it's interesting to see that openTSNE is slower than FIt-SNE using multiple cores on your end. What CPU, OS, and Python version are you using, if you don't mind me asking? I ran all the benchmarks on an Intel Xeon, which is probably different from laptop CPUs.

Right now I am running everything on my not-very-powerful laptop running Ubuntu and Python 3.7. But my recollection is that FIt-SNE has always been faster then openTSNE on my lab computer, which is a powerful workstation (also Ubuntu). So that's strange.

because the conda packages come precompiled. FFTW needs to be available at compile-time, so on the conda-forge servers.

I don't really understand how all this works, but somebody put FIt-SNE to conda, and it seems to work, even though it uses FFTW.

I realised after the release yesterday that Annoy doesn't provide wheels, and it doesn't provide Windows conda builds.

Oops. Again I am not sure what this means exactly, but sounds like a mess. Is it something you can workaround?

pavlin-policar commented 4 years ago

The speed difference I noticed was marginal, but it was consistent. This was run with FFTW. FIt-SNE is faster on a single core. Again, this was on a Xeon processor, so I'd like to run this on laptop computers as well before making any definitive statements. And to figure out why this might be happening...

benchmarks

When you make a python package that includes C code, you can distribute it in two ways: 1) source dist (sdist) which is a tar file which has the source code and needs to be compiled on your computer when you run pip install x 2) wheels which are compiled ahead of time somewhere else. With wheels you get a binary which you can use straight away. Why wheels? Because some people (mainly windows folks) don't have a C compiler installed, or they might not have the appropriate libraries. Without a compiler, the sdist is useless to them, but a wheel will run straight away. Conda is similar to wheels, where it builds the package on their servers. When you run conda install x you get the prebuilt binary which you can use straight away, without needing to compile locally.

In both FIt-SNE and openTSNE FFTW needs to be available at compile-time, when the binaries are being built. So for your example above, running conda install fftw would pull down the binary for fftw and conda install openTSNE would pull down the binary for openTSNE. This is already compiled somewhere else, so having the FFTW thing doesn't do anything. However, if you compiled from source i.e. python setup.py install or even pip install openTSNE (assuming it pulls down the sdist and not the wheels) after conda install fftw then compilation would occur locally on your computer, and it could use the FFTW headers.

Hopefully, that clears things up a bit. Don't hold me to all the details, but conceptually, that's how it works 😄

dkobak commented 4 years ago

Again, this was on a Xeon processor, so I'd like to run this on laptop computers as well before making any definitive statements.

Interesting. Definitely run it on your laptop and if it still disagrees with that I see on my laptop, we can try to dig into it deeper. My laptop processor is Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz.

Hopefully, that clears things up a bit.

Yes it does! Very helpful, thanks.

So for FIt-SNE package on conda, the FFTW has been built in at compilation time on conda servers? Could you use the same approach for your conda package then? Then anybody installing openTSNE from conda would get it with FFTW, which would be a good thing?

Annoy doesn't provide wheels, and it doesn't provide Windows conda builds.

So it's not possible to install Annoy from conda under Windows? Which means that it's currently not possible to install the openTSNE from conda under Windows? What could be a solution to that? I think FIt-SNE includes all the necessary Annoy .lib files for Windows within its source code... That said, conda package for FIt-SNE currently does not work under Windows :-/

pavlin-policar commented 4 years ago

So for FIt-SNE package on conda, the FFTW has been built in at compilation time on conda servers?

Yes, exactly, this build command is specified in build.sh in the FIt-SNE feedstock, which is run on the conda-forge servers.

Could you use the same approach for your conda package then?

Yes, this would be the goal. I think I actually tried this at some point but forgot why I abandoned it.

So it's not possible to install Annoy from conda under Windows?

It's possible, it's just going to use the old version v0.3.12 because I didn't want to merge something that doesn't work on Windows. Pip pulls from PyPi, which has v0.4, and conda has the old version. I will try to get this resolved ASAP.

I'll probably end up packaging Annoy into openTSNE, like FIt-SNE does, and make my own bindings to it. That's probably the fastest solution. I've opened up an issue on Annoy, and they seem open to doing it, but my experience is that these types of things don't tend to move that quickly. Setting the build process up can be a real pain. Ideally, annoy would include a conda binary for windows (which they currently don't), as well as wheels (which they currently don't).

pavlin-policar commented 4 years ago

Ok, so I just tested MNIST on my Macbook pro which has an Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz.

openTSNE (FFTW): 119.76 sec
openTSNE (numpy): 113.29 sec
FIt-SNE: 57.38 sec

I think both t-SNE runs are within the error bounds and after several runs they even out. I definitely don't see the drastic difference you saw earlier. FIt-SNE is faster here on 8 cores, unlike on the Xeon processor. Maybe it has something to do with the CPU architecture or something... In any case, I need to benchmark things on other CPUs.

dkobak commented 4 years ago

That's weird, because it looks like it's basically the same processor. My runtimes are reported here: https://twitter.com/hippopedoid/status/1257673387184394240. (That's after I recompiled openTSNE with FFTW. Before it was just over 3 minutes.)

pavlin-policar commented 4 years ago

Here are the benchmarks from the server. These are just the results of a single run, but it should give a rough idea of how things behave.

	1000	100000	250000	500000	750000	1000000
FFTW (1 core)	16	342	896	1878	2838	3814
numpy (1 core)	29	358	924	1921	2904	3922
FFTW (8 core)	14	97	252	547	860	1185
numpy (8 core)	25	108	265	578

(top row indicates data set size, numbers in cells indicate time in seconds)

So it definitely looks like there's about a 20-30 sec difference between the FFTW implementation and numpy implementation, which is consistent with what you were seeing. The fact that it's a pretty constant difference regardless of data set size supports this theory. Benchmarks on personal laptops are always shaky because who knows what else is running in the background. In any case, it's clear that numpy is slower, so I'll try to get everything to use FFTW.

pavlin-policar commented 4 years ago

So I ran the MNIST benchmark again on my MacBook:

openTSNE (FFTW): 76.02 sec
openTSNE (numpy): 80.09 sec
FIt-SNE: 55.04 sec

The difference between FIt-SNE and openTSNE is not as large as it was the first time I ran it. I haven't really changed anything in any meaningful way, other than moving annoy to the repo, so I don't know what was going on in the benchmark I ran two days ago.

dkobak commented 4 years ago

In any case, FIt-SNE is the fastest, and openTSNE+FFTW is a little faster than openTSNE+numpy, so this qualitatively agrees with what I got on my laptop, even though for me the speed-up from FFTW was much higher. But who knows, maybe something was wrong with my test.

pavlin-policar commented 4 years ago

Either way, I am running the same set of benchmarks on another laptop with an i7, which is comparable to what most people have in their laptops. Xeon processors are super optimized and have a lot of enhancements and different instruction sets specifically for math-type operations, which consumer-grade CPUs don't have.

I will update the readme so that it better describes the speed difference. Otherwise, I'm closing this, there doesn't seem to be anything to fix. I did try to get FFTW into the wheels and conda packages today, but I was unsuccessful, so this will have to wait a bit.

dkobak commented 4 years ago

I did try to get FFTW into the wheels and conda packages today, but I was unsuccessful, so this will have to wait a bit.

Too bad!

I will update the readme so that it better describes the speed difference.

Yeah, I think basically adding a more explicit instruction for how to install it using FFTW (e.g. mention that FFTW can be installed from conda), and also maybe mention that the speed increase on some systems can be noticeable even earlier than with "millions" of points.

pavlin-policar / openTSNE

FFTW vs numpy.FFT speed gain #128