running on multiple cores?

joernroeder commented 8 years ago

Hey, it's more a question than an actual issue: I'm mapping a dataset with 32dims x 900000items with tsne on a multi-core machine but as tsne is single threaded i'm just using one core. Do you have any tipps or tricks how i can split the dataset to parallelize computation? thanks in advance!

lvdmaaten commented 8 years ago

There are basically two important loops that should be straightforward to parallelize:

https://github.com/lvdmaaten/bhtsne/blob/master/sptree.cpp#L385 https://github.com/lvdmaaten/bhtsne/blob/master/tsne.cpp#L215

The first loop is actually embarrassingly parallel, so it should be completely trivial. The second loop may be somewhat trickier because each iteration may access the same nodes of the tree, so it is somewhat less predictable what speedups you can get there.

joernroeder commented 8 years ago

@lvdmaaten thanks for the infos. I'll look a bit deeper into the loops and play around with it as soon as i find some spare time for it :)

maximsch2 commented 8 years ago

I have an OpenMP based version here: https://github.com/maximsch2/bhtsne. I don't like the binary file interface, so I'm also modifying it to build as a shared library and expose a simple C API. An extra pair of eyes is always useful when writing parallel code, so if anyone else is interested, go ahead and try it out!

lvdmaaten commented 8 years ago

I'm no OpenMP expert, but it looks good to me. What kind of speed-ups are you seeing compared to the non-OpenMP code?

maximsch2 commented 8 years ago

I haven't benchmarked it on big datasets yet, but I think I get around 1.3-1.5x on two cores on a smallish dataset (takes a couple of seconds to build). My main motivation was integrating it into visualization system where I wanted to generate t-SNE graphs on the fly. I think making it available as a library (as opposed to writing and reading files), and encoding output dimension as a template parameter (allowing compiler to do less allocation and also hopefully unroll all those loops along dimensions, especially when out_dims=2,3) gave me at least 3-4x improvement in performance.

lvdmaaten commented 8 years ago

Nice! In an earlier version of the code, I had hardcoded the output dimension. Indeed, that was a lot faster than the version that is currently in the repo.

iraykhel commented 7 years ago

@maximsch2 Would it be complicated to provide Windows build support for your OpenMP implementation? Thanks :)

maximsch2 commented 7 years ago

I don't have an easy access to a Windows machine, but I don't think there is anything unix-specific that I've added there. I think MinGW supports OpenMP, so you should be able to build it using gcc just like you would do on Linux. Are you having some specific issues with Windows?

EDIT: I've just checked and apparently there is even a description of how to build it on Windows. I haven't updated the Makefile.win though, so it might be a bit broken... Can you try just building tsne_bin.cpp using Visual Studio to get a binary? Just that file, no reason to add anything else to it.

EDIT2: I've pushed a version with updated Makefile.win, which has a higher chance of working.

maximsch2 commented 7 years ago

@lvdmaaten Answering your prior question about speed up, I went ahead and benchmarked it on a 25000x50 dataset, on quad core CPU (with HT, so it presents itself as 8-core). Results:

OMP_NUM_THREADS=1
real    2m23.372s
user    2m23.269s
sys 0m0.091s

OMP_NUM_THREADS=2
real    1m49.877s
user    2m31.483s
sys 0m0.103s

OMP_NUM_THREADS=4
real    1m27.637s
user    2m34.543s
sys 0m0.139s

OMP_NUM_THREADS=8
real    1m24.935s
user    3m39.806s
sys 0m0.159s

This is a total time including reading a file, building a tree (currently not paralellized and takes around 25 seconds in this case) and doing 1000 iterations of embedding learning. Speed up is not perfect, but some scaling is there.

iraykhel commented 7 years ago

Thanks, seems to be working :)

lvdmaaten commented 7 years ago

Nice!

baobabKoodaa commented 7 years ago

I was able to get this version working on Windows, but the multicore version by maximsch2 is just returning with a "non-zero return code" pretty fast. It's not giving more information even though verbose=True.

maximsch2 commented 7 years ago

@baobabKoodaa If you want, I can try running your script/data here on Linux to see if this is a Windows-specific issue.

baobabKoodaa commented 7 years ago

@maximsch2 Thank you for the idea and for the kind offer. I will try running it on a Linux machine at some point, for now I can make due with the single core version.

kylemcdonald commented 7 years ago

Since it hasn't been mentioned yet, see https://github.com/DmitryUlyanov/Multicore-TSNE

rappdw commented 7 years ago

Based on the above discussion, we've made some modifications to the code on this fork and see some significant performance improvements when running on multiple cores. (We've documented the performance tests and resultant improvement here.

lvdmaaten / bhtsne

running on multiple cores? #18