kwikteam / klustakwik2

Fast software for high-dimensional cluster analysis using the masked EM algorithm for Gaussians mixtures
BSD 3-Clause "New" or "Revised" License
31 stars 13 forks source link

Consider trying Numba before C/Cython #3

Open rossant opened 9 years ago

rossant commented 9 years ago

If we find out that we can't achieve good performance in pure Python/NumPy, then I'd suggest to start trying Numba before trying C/Cython. I've tried the latest version lately (0.17) and the performance improvements look very interesting in nopython mode. There are many constraints though, including:

That being said, Python loops on arrays are fast. Also, NumPy ufuncs can be created from pure Python kernels.

Finally, some advantages of Numba over C/Cython:

thesamovar commented 9 years ago

I'd love to use numba rather than C/C++ if it is good and I'll definitely give it a try.

I'm mildly concerned that it ties us into the Anaconda distribution though?

rossant commented 9 years ago

Yes that's a potential downside. I've just checked:

Otherwise installing it from source is probably hard.

That being said, if someone doesn't want to use the free Anaconda distrib or another scientific Python distribution with a package manager, that's their problem... I assume these people will be power users who will be okay with compiling Numba from source. And I expect the program to still work (slower) without Numba.

In all cases, I think it's worth trying and benchmarking an algorithm in Numba before moving to C/C++/Cython.

thesamovar commented 9 years ago

I don't agree that if they don't want to install Anaconda or similar it's their problem. I think if we don't make it easy to install and compatible with many different distributions it's our problem. However, maybe it will get easier to install over time. I do agree that it's worth trying it out though.

rossant commented 9 years ago

I don't agree that if they don't want to install Anaconda or similar it's their problem. I think if we don't make it easy to install and compatible with many different distributions it's our problem.

Can you be more specific? In what case would someone not want to use Anaconda nor Canopy nor WinPython nor Chris Gohlke's installers?

rossant commented 9 years ago

A bit of background: KlustaViewa has always been horribly difficult to install. It was so bad that we ended up shipping our own portable Python distribution for Windows users (based on WinPython). Windows users had a one-click installer and the problem was solved. The problem wasn't solved for Linux and Mac users.

Eventually we forced our users to move to Anaconda. Now there are virtually no more installation problem.

thesamovar commented 9 years ago

Well I always used to use Python(x,y), for example, and you say that Canopy doesn't have it in the free version so essentially it doesn't have it at all (since who wants to pay $200/yr for free software?). I know people who just use the standard Python distribution too. At least for Windows 32 bit, installing numpy, scipy, etc. is fairly easy in this case (just double clicking on .exe installers). And on Linux people will probably be using all sorts of other package managers that may find numba difficult (not sure about this).

I don't know enough about what the problems were for KlustaViewa, but at least for Brian even though we have been tempted to go down the route of requiring specific distributions, we always managed to find a way to make it more cross-platform in the end. It's definitely more work, but I think it's worth it.

I'd rather strongly recommend a particular distribution (as we do with Brian) than require it.

thesamovar commented 9 years ago

Hey, @mstimberg: this isn't your project but do you feel like weighing in one way or the other?

rossant commented 9 years ago

Well I always used to use Python(x,y), for example

I forgot this one -- looks like they don't have Numba. :( might be worth sending them a patch! ;)

Canopy doesn't have it in the free version

It looks like they have it in the full version which is free for academics

At least for Windows 32 bit, installing numpy, scipy, etc. is fairly easy in this case (just double clicking on .exe installers)

It looks Chris Gohlke has 32-bit and 64-bit versions of Numba 0.17 for Windows, so it should also be double click on any Python distrib for Windows (I just noticed that the installers are not .exe anymore, but wheels..?)

And on Linux people will probably be using all sorts of other package managers that may find numba difficult (not sure about this).

I think the worst case scenario for Linux is not too bad: it should be straightforward to build Numba there. I'm more worried about Mac users but I don't know enough this platform to tell.

we always managed to find a way to make it more cross-platform in the end. It's definitely more work, but I think it's worth it.

Agreed

I'd rather strongly recommend a particular distribution (as we do with Brian) than require it.

Agreed, and that's what I expect to do with KK2/KV etc. We'll put manual install instructions for power users on the website for sure. Anaconda or another Python distrib will be strongly recommended but never required for the software to work.

thesamovar commented 9 years ago

Python(x,y) has it on their issue list but anyway I think this distribution is getting a bit less relevant now. Good about Canopy being free for academics. I hadn't realised Chris Gohlke had 32 bit installers as well as 64 bit, that's nice. OK, so actually maybe it's not so bad after all.

Mac is always a problem - nothing we can do about that. ;)

rossant commented 9 years ago

Mac is always a problem - nothing we can do about that. ;)

Yeah, but in our experience Anaconda solves it quite nicely! That's the only way our Mac users have been able to use KlustaViewa.

There's another point to make: using Numba instead of C/C++/Cython puts the installation/building/packaging burden on them (Continuum Analytics) rather than ours. I think it's more trouble to compile and provide binaries for all platforms ourselves than just depending on a more mainstream package like Numba.

mstimberg commented 9 years ago

Hi there, I tend to agree that numba is the way to go, keeping your library free of C/Cython extensions is a huge plus. If you go this route, you can also easily upload a conda package to binstar and people can install it via conda install (installing from pypi with pip can otherwise be annoying as dependencies are downloaded and built from source and not using the conda packages). Something else to keep in mind: numbapro is free for academics, having a little wrapper around the vectorize decorator could allow you to support both standard numba and numbapro (which could for example target the GPU).

datnamer commented 9 years ago

fyi, it looks like nopython array expressions, allocation and loop fusion have been merged.

thesamovar commented 9 years ago

What does that mean exactly?

datnamer commented 9 years ago

It means you can write very fast numba code (on the order of C or close to it) and code almost like you were writing a regular python function.