kwikteam / klustakwik2

Fast software for high-dimensional cluster analysis using the masked EM algorithm for Gaussians mixtures
BSD 3-Clause "New" or "Revised" License
27 stars 13 forks source link

[WIP] Distributed processing #60

Open thesamovar opened 9 years ago

thesamovar commented 9 years ago

@nippoo @rossant I have now implemented distributed processing using IPython.parallel in this branch. It won't be highly efficient yet because I did some slightly lame things in order to get it working at all, which should be possible to make more efficient. Also, try_splits is not distributed yet, and so this isn't at all faster. However, it should hopefully work for normal iterations now, and I think even with the inefficiencies for a large data set it will be much faster when distributed across multiple machines. Could you try this out? The ipython notebook in dev can be used, you need to start up an appropriate engine before running it. It may also be the case that KK2 needs to be installed on each machine it runs on? Not sure how IPython handles that.

rossant commented 9 years ago

thanks @thesamovar we'll have a look! yeah KK2 probably needs to be installed on every machine first. I'm wondering if we can use docker to simplify deployement on many machines -- there should not be a significant performance hit, but that remains to be checked

nippoo commented 9 years ago

@rossant there's no need to build images or deploy them: assuming the hardware is the same, just share a folder in the PATH over NFS or similar. This is exactly how I share KK1/SD2 over our servers. Legion and every computing cluster will run the same software on all nodes and share a directory over NFS. It's simple!

rossant commented 9 years ago

@nippoo neat!

rossant commented 9 years ago

@nippoo wait, this will only work for our code, not for conda dependencies..?

nippoo commented 9 years ago

It works for everything - just make sure the shared miniconda installation / venv is in the PATH...