pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem
https://sparse.pydata.org
BSD 3-Clause "New" or "Revised" License
600 stars 126 forks source link

Use Cython, Numba, or C/C++ for algorithmic code #126

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

It seems likely that we'll want the ability to write fast numeric code in a low-level-ish language. There are a few options:

  1. Cython
  2. Numba
  3. C/C++, wrapped with a variety of other options

We should make a decision here before investing serious time in one path or the other.

There are, I think, a few categories of concerns:

  1. Speed
  2. Community adoptability (as a dependency)
  3. Barriers to attracting new developers
  4. Barriers to maintenance
mrocklin commented 6 years ago

My personal thoughts

The default today seems to be Cython. This would be a fine choice. It seems to be fine for speed and community adoptability. I do think that it imposes a serious barrier to attracting new developers (I'd say 80% of the candidate pool will avoid Cython), and it will force a more complex build-and-deploy process. For example we can no longer just point people to try things on master to get quick feedback. We'll get questions from people on how to install from source on windows, etc..

C/C++ is, I think, a little cleaner than Cython in terms of attracting new developers. I think that C tends to be more often within people's comfort zone.

Numba is great for new developers and maintenance, but has issues in community adoptability (folks aren't accustomed to depending on it, thoughts here). Numba also has issues if we're going to be doing a lot with dynamic data structures and want to use std::map, std::vector and friends. Having some familiarity with both Numba and Cython I personally also find the development cycle to be must faster with Numba than with Cython.

Suggestion

If I were doing this work I would stick to Numba until numba became an issue (either due to a need for dynamic data structures or downstream libraries being unwilling to depend on it) and then I would switch to something else. The cost of starting with Numba to me seems to be zero. We don't have to update our release mechanism, we don't have to have special documentation or field extra compilation help requests. The cost to switch to Cython if necessary is likely very small. We'll likely be writing exactly the same code we would write for Cython, just with less stuff around it.

Numba is also likely to be faster without us thinking as much, which I think is useful when experimenting with new algorithms, like in #125

However

However, I'm unlikely to do any of this work in the near future, and the preferences of those doing the work should have precedence.

mrocklin commented 6 years ago

cc @rgommers @ogrisel @stefanv @shoyer for thoughts and hopefully critical feedback. In what system would you like to see low-level code in the pydata/sparse library written?

hameerabbasi commented 6 years ago

In favor of Cython

  1. Speed is near-native C++ with the benefits of Python.
  2. Community adoptability: We will only need it for development, not as a hard dependency, so it has zero impact.
  3. Barriers to attracting new developers: Only a few algorithms need to be in Cython/Wrapped C++, others will be pure Python. Developers will have the choice of both wrapped C++ with Cython, or pure Cython.
  4. Wrapping logic is really trivial. BLAS and friends have wrappers in SciPy, we could use those.
  5. Making DOK faster is a priority for me for batch assignments (i.e. ndarray and SparseArray assignments), this needs std::unordered_map.
  6. Switching can be a somewhat expensive process down the line.
  7. @shoyer's algorithm for elemwise discussed in #1 will come in useful for CSD. Since the passes in that algorithm are likely to be slow, Numba could be almost twice as slow as std::vector. Other uses are very likely to pop up sooner or later.
  8. Debugging is better. This can be critical with complex logic.
  9. Interleaving code is relatively painless.
  10. We could re-use some things such as for radix sort.

In favor of Numba

  1. Pure Python. This could be useful for attracting devs.
  2. Faster performance overall.
  3. Iteration of N-dimensional arrays is possible without complex logic. This will be useful for advanced indexing.
  4. GPU support (?)

My Concerns

I guess it comes down to exactly two questions for me: Will we have to switch at some point (which I would like to avoid outright), or will Numba devs be willing to add support for accelerated dict and list operations, and interleaving? If the answers to both these questions are in the favor of Numba, then it's a no-brainer for me.

If, in the other hand, the answer to even one of these questions is "no"... Then my vote is for a mixture of Cython and Cython-wrapped C++.

I guess my main worries with Numba come down to the following:

mrocklin commented 6 years ago

Some thoughts:

  1. I would not expect accelerated list/dict implementations from Numba. This is probably out of scope for them.
  2. Where possible I would like to see us avoid dynamic data structures for performance reasons
  3. I would not expect std::map to be significantly faster than Python's dict. At best I would hope for 2-3x improvements, not 10x.
  4. I find debugging to be much nicer in numba, I just remove numba.jit. I'd be curious to learn your debugging technique for Cython
  5. Can you expand what you mean by interleaving?
  6. I think that the presence of Cython in the codebase is itself a barrier to adoption. People won't be able to run from master without passing through a compilation step.
  7. I'm surprised to hear that switching from numba to cython seems expensive to you. To me it seems like the only cost is the work saved in the first place. In my experiences going the opposite direction (Cython -> Numba) code typically works after I strip out all of the annotations, rename the file from pyx to py, and put numba.jit on the function. For numeric for-loopy code they seem to be more or less equivalent to me.

However, mostly this is me trying to convince you not to adopt Cython prematurely. At this point you're doing most of the work and should probably decide.

As a co-maintainer though I would appreciate it if, prior to actually writing algorithmic code in Cython you first to setup the build process, packaging support (we should consider conda-forge), and documentation. I don't expect this to be particularly difficult, I don't think we'll need any external libraries to speak of, but it's best to understand the costs ahead of time and build in solutions in case you leave this project in the future.

mrocklin commented 6 years ago

FWIW while I'm probably biased towards numba due to professional association I also have a lot of respect for Cython. As someone who has used both in different projects though (numba for personal things, cython for geopandas) and as someone who maintains a number of projects, the added cost of maintaining a Cython project in terms of person-hours is high, and that cost tends to be focused on fewer maintainers. I think I tend to optimize these days more to reduce and diffuse maintenance costs than most other things.

I personally would probably stop spending as much personal time fielding maintenance questions or build/release issues if we go for Cython. This is something that the community would need to pick up long term.

hameerabbasi commented 6 years ago

I would not expect std::map to be significantly faster than Python's dict. At best I would hope for 2-3x improvements, not 10x.

There is also std::unordered_map, which is 2x-3x faster than std::map, and is the best equivalent to dict. std::map is more of a binary search tree.

I'd be curious to learn your debugging technique for Cython

GDB supports python natively, and Cython has extensions for GDB. It's on the command line, though (I've tried and failed to find a good IDE solution), so it is a bit of a pain compared to your solution. Also, it freezes your interpreter in, so conda env is a no-no.

Can you expand what you mean by interleaving?

Interleaving Pure Python/optimized code. Other problems would be calling unoptimized code from optimized.

People won't be able to run from master without passing through a compilation step

In my experiences going the opposite direction (Cython -> Numba) code typically works after I strip out all of the annotations, rename the file from pyx to py, and put numba.jit on the function.

There are also costs in terms of alienated developers we may have accrued that are used to Numba.

However, mostly this is me trying to convince you not to adopt Cython prematurely.

Of course, I understand the costs, and if the consensus is on Cython, then I would build these solutions in, along with docs, right at the very first check-in. I believe properly > quickly.

After your convincing and experience, I'm not too opposed to Numba, but would certainly like to wait for thoughts of others (particularly Scipy devs and whether they would consider it to be a blocker) before committing to it.

At this point you're doing most of the work and should probably decide.

My vote is still slightly in favor of Cython, but I'm open to Numba as an option. I don't believe I should make the decision, I would really wait for community consensus.

albop commented 6 years ago

I have started watching this project only recently and although I was super happy to finally a sparse multidimensional tensor lib that I could contribute to I'm just an observer here. Funnily enough I spent some time yesterday trying to implement specialized tensor products for DOK with numba, and found out that dicts are not a recognized type yet. I used ordered list of tuples instead of dicts everywhere, which seems to be supported well but @mrocklin must know the details better than I do. Here is one argument in favour of numba, which I haven't seen listed above. I have found very useful in other projects the ability to generate code, based on number of dimensions for instance, and jit-compile it to a fast, memory-savvy option. There are several instances, where I found this approach to be much faster than using numpy's generic functions, and I wouldn't know how to perform the same with Cython or C/C++ for that matter as dimension combinations are not known at compile-time.

hameerabbasi commented 6 years ago

At this point, if Scipy devs came out and said they would be okay to depend on this project if Numba was used, I'd go for Numba. I just had a look at the docs and they're significantly better than Cython's (which alone makes them worth considering).

It'd be nice to have another contributor, @albop. :-)

albop commented 6 years ago

@hameerabbasi : I was considering saying something about me contributing to the project, but wouldn't like to disappoint if it takes me a bit too long to start. I'll try to create some issues first ;-) (with the usecase I have in mind...)

hameerabbasi commented 6 years ago

@albop If you have any questions about where something is handled, or how to implement something, don't hesitate to raise an issue and cc me on it. :-)

shoyer commented 6 years ago

Numba actually does currently accelerated list and set. I don't know if accelerating dict is on their roadmap or not.

ogrisel commented 6 years ago

C++ is fine as long as the wrapper tool is modern enough to make it easy to package as a wheel (e.g. Cython or pybind11, no boost python please). I am fine with numba as a dependency as well. I can't speak for scipy developers.

hameerabbasi commented 6 years ago

They do, it's even been labelled a high priority. xref numba/numba#2096

Numba actually does currently accelerated list and set. I don't know if accelerating dict is on their roadmap or not.

This makes me much more comfortable using Numba.

rgommers commented 6 years ago

I can't speak for all SciPy developers, but if Numba continues to improve and be maintained (not sure of its status now with all the changes at Anaconda?) then at some point we'll likely adopt it. The concerns regarding build dependencies etc. are minor compared to all the C/C++/Fortran we have already. If you want me to start that discussion now on scipy-dev, I'd be happy to do so.

Re debugging: it depends on what you're debugging. If it's an issue with the logic of your code, then Numba is much better (just remove @jit). If it's a problem with Numba or Cython behavior though, then Cython's debugging support is far better than Numba's. The latter case is probably less common, but not that uncommon - both due to changes in new releases and things that don't work identically to plain Python code in the first place. I'd say that as a contributor I would prefer Numba, and as a library maintainer I'd prefer Cython.

We recently got a similar question from Pythran, and I came up with a list of factors for the decision at https://mail.python.org/pipermail/scipy-dev/2018-January/022334.html. The ones I'm not sure about for Numba are:

datnamer commented 6 years ago

How will the sparse array object be represented in numba? The JIT class capability.is currently very limited.

hameerabbasi commented 6 years ago

@datnamer This is a discussion on how we will adopt Numba for our internal algorithms, to make our own library faster for users. It isn't much of a discussion about how to make our classes JIT-compatible.

hameerabbasi commented 6 years ago

If you want me to start that discussion now on scipy-dev, I'd be happy to do so.

That'd be really appreciated. 😃

hameerabbasi commented 6 years ago

cc @woodmd, since you're planning to contribute, you should get a say. Also, @nils-werner.

stefanv commented 6 years ago

As long as things go well, numba has an edge above Cython. But once they start to misbehave, I'm not sure a) how to identify that it is misbehaving in numba or b) how to debug numba and steer it in the right direction. We have good answers to these questions for Cython because we've been using it for long all over the scientific Python ecosystem, but if you can find similar answers for numba it would be tempting to use that instead (the code will almost certainly be simpler to implement and debug).

hameerabbasi commented 6 years ago

One question at large regarding Cython, which will be very useful later: How hard is it to make it work for general N-dimensional arrays? From what I can tell, only arrays of a specific dimension can be an input to a function.

I personally would probably stop spending as much personal time fielding maintenance questions or build/release issues if we go for Cython. This is something that the community would need to pick up long term.

Would you be willing to contribute in code (for the pure Python parts) and do code reviews (for both parts)? FWIW, I only intend to move the bottlenecks (currently only indexing, sorting, matching) to Cython/Numba/C++, keeping all else as-is. I don't intend to write the entire thing in another language, Numpy already does a lot of that for us. I also care a lot about maintainability and attracting devs, I'm sorry if it came across otherwise. Maintainability and community > A few ms shaved off an operation.

If building wheels is an issue, I have access to all three major OSs on my personal machine, and can easily build wheels for all of them (although I believe CircleCI/AppVeyor/Travis do this already)

You're a big part of this project at this stage, losing that could be a fatal blow to it.

nils-werner commented 6 years ago

Re nD data structures: All internal data of COO arrays are 1D or 2D, are they not?

hameerabbasi commented 6 years ago

There are a few use cases for N-D iteration in our code that I can think of at this point:

  1. Batch DOK assignment.
  2. Advanced indexing.
hameerabbasi commented 6 years ago

portability (exotic stuff like AIX, Raspberry Pi, etc.)

It depends on LLVM via llvmlite, which is quite portable, I think.

maintenance status

Last release was 25 days ago, code frequency seems to be good. Not sure about the future, though. @mrocklin might have insights on that.

mrocklin commented 6 years ago

I've raised an issue highlighting debugging issues in Numba here: https://github.com/numba/numba/issues/2788

Engagement there would be welcome.

mrocklin commented 6 years ago

I do not have any major concerns about the longevity of Numba. Folks probably shouldn't rely on my opinion here due to the conflict of interest. I'd be happy to chat in more depth with folks if desired.

hameerabbasi commented 6 years ago

this optimization will allow to compile inner loops in nopython mode regardless of what code surrounds those inner loops.

This was what I was talking about when I spoke of interleaving, so that concern can be put aside. Not too long ago, this wasn't supported IIRC. It also means templates and custom dtypes are supported.

shoyer commented 6 years ago

One question at large regarding Cython, which will be very useful later: How hard is it to make it work for general N-dimensional arrays? From what I can tell, only arrays of a specific dimension can be an input to a function.

Numba has vectorize and guvectorize (which are great!) but in my (outdated) experience it was still easy to run into performance problems when trying to write code that works for any number of dimensions.

Using Cython alone may make this a little trickier since you need to know the number of dimensions at compile time, but there's always the option of using C++ libraries like xtensor which supports dynamic numbers of dimensions.

hameerabbasi commented 6 years ago

Numba has vectorize and guvectorize (which are great!) but in my experience its still easy to run into performance problems when trying to write code that works for any number of dimensions.

I see the last commit to that library was three years ago. Would you be kind enough to re-run those benchmarks? Numba may have changed significantly.

shoyer commented 6 years ago

I see the last commit to that library was three years ago. Would you be kind enough to re-run those benchmarks? Numba may have changed significantly.

Indeed, I'm sure things have improved since then -- certainly numba fixed many of the issues I raised a few years ago.

Unfortunately I don't have a benchmark suite available to run again. I had some notebooks, but those were lost on an old work computer.

It's still fair to say that writing N-dimensional code will be somewhat easier with Numba than Cython.

hameerabbasi commented 6 years ago

I'm sold on Numba. The only things missing now are:

  1. Some support for @jitclass like Cython's cdef (inside classes), for fast dicts (since it needs to be kept in a class, not just a function).
  2. dict acceleration.
  3. SciPy dev opinions.
rgommers commented 6 years ago

SciPy dev opinions.

Discussion started at https://mail.python.org/pipermail/scipy-dev/2018-March/022576.html

teoliphant commented 6 years ago

The changes at Anaconda will have only a net-positive impact on Numba. The Numba developers are well-supported there and in addition, my new company Quansight will also be looking to continue to support Numba. I'm all for improvements to Numba that make it easier to use it as a dependency.

jakevdp commented 6 years ago

It's great to see this discussion!

I've not done a lot with numba since https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/ (geez, has it been that long?) but I've been pleasantly surprised to see some of my core concerns addressed in that time, namely: ease of installation and difficulty debugging/troubleshooting.

If we're going to wait for numba to be as mature as Cython to adopt it widely, we'll be waiting forever — that's like a kid asking how long it will take for him to be the same age as his older brother.

I'd personally be pleased if the community started using numba more centrally, even if there were some growing pains attached to it. Having that many developers working with the project could only improve it, I suspect, particularly if numba's current development team has the bandwidth to support the additional users!

jakevdp commented 6 years ago

If the community as a whole is leaning toward this, it might be useful to organize folks who are willing to do some "case studies" – come up with a set of core algorithms that devs will implement in numba, to get a feel for the ease-of-use and pitfalls of using modern numba in practice. A few things come to mind:

If we collectively undertook a half dozen case studies like this, and found that (1) performance is as-good as or better than cython versions, and (2) debugging and trouble-shooting is not an issue, then I think it would be a clear win.

mrocklin commented 6 years ago

cc @gforsyth

On Tue, Mar 6, 2018 at 7:53 AM, Jake Vanderplas notifications@github.com wrote:

If the community as a whole is leaning toward this, it might be useful to organize folks who are willing to do some "case studies" – come up with a set of core algorithms that devs will implement in numba, to get a feel for the ease-of-use and pitfalls of using numba in practice. A few things come to mind:

  • pieces of the sparse library
  • core portions of the CSR/CSC matrix algorithms
  • a metric tree (KD Tree or Ball Tree) accepting JIT-compiled user-specified distance metrics
  • a few optimization algorithms accepting JIT-compiled functions to optimize
  • a few numerical integration algorithms accepting JIT-compiled integrands

If we collectively undertook a half dozen case studies like this, and found that (1) performance is as-good as or better than cython versions, and (2) debugging and trouble-shooting is not an issue, then I think it would be a clear win.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/sparse/issues/126#issuecomment-370772122, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszCb01tZtTH3CPYubPXTV08tEiVOwks5tbobmgaJpZM4SbWBy .

hameerabbasi commented 6 years ago

If we're going to wait for numba to be as mature as Cython to adopt it widely, we'll be waiting forever — that's like a kid asking how long it will take for him to be the same age as his older brother.

Numba has come a long way even without wide community adoption. Adoption could get it more attention and developers and very well help it overtake Cython in most areas, which I would be happy to see given its ease of use.

pieces of the sparse library

That's the plan so far. The only (forseeable) things missing are listed in https://github.com/pydata/sparse/issues/126#issuecomment-370534201

core portions of the CSR/CSC matrix algorithms

I don't plan to re-implement algorithms already in BLAS, LAPACK, ARPACK, etc. But elementary operations such as element-wise and reductions, definitely. However, I plan to only use Numba if things become bottlenecks or optimized implementations aren't possible with the tools in Numpy.

a metric tree (KD Tree or Ball Tree) accepting JIT-compiled user-specified distance metrics

This will need better support for @jitclass than Numba currently provides, I assume.

  • a few optimization algorithms accepting JIT-compiled functions to optimize
  • a few numerical integration algorithms accepting JIT-compiled integrands

Does Numba compile Python lambdas automatically, or attempt to? If so, these are trivial.

datnamer commented 6 years ago

Another in favor of cython is that it allows meaningful compile time integration with type hints: https://github.com/cython/cython/issues/1672

Given that numpy is moving this way: https://github.com/numpy/numpy_stubs

ecurtin2 commented 6 years ago

Numba also has type annotation inference in the pipeline it seems:

https://github.com/numba/numba/issues/2334

I've played around a bit with Numba recently (They seem to be accelerating development dramatically, as in several new features monthly it seems).

I tested out @guvectorize on some simple stuff and my labmate (who is experienced in CUDA-C) did the equivalent C code, and the timings were identical. In more advanced use cases, you can use a more CUDA-python api to do what you need and IIRC it all compiles down to the same CUDA llvm anyway. This is a big win in my opinion.

I've never tried this in Cython, but I guarantee its not as easy.

Numba also now supports creating C callable functions http://numba.pydata.org/numba-doc/dev/user/cfunc.html which work with Scipy's integration routines as-is. You can also pass in a normal jit function, but this comes with the python interpreter overhead.

So after playing around my assessment is as follows:

Disclaimer: I'm not a library dev I just really like numba.

Note: numba also supports ahead-of-time compilation now too.

datnamer commented 6 years ago

@ecurtin2, per @sklam 's comment:

I don't want to encourage users to annotate function types. I think the current numba type system is very limiting and it should be replaced.

It doesn't seem like the team is leaning towards using user supplied type constraints, which I think are very useful. If this is the case, it will leave sparse out of sync with numpy (and pytorch, but that's not as important).

Also sounds like there will be significant churn in the type system at some point.

rgommers commented 6 years ago

Interesting discussion, but type annotations are not approved for use in NumPy or SciPy, and not particularly relevant for the decision.

shoyer commented 6 years ago

Indeed, type annotations for NumPy are still very new/experimental (more contributions would be welcome!). I'm sure we'll get there eventually, but Numba is hardly out of sync yet.

datnamer commented 6 years ago

@rgommers, at the moment, but it seems that they might be:

from https://github.com/numpy/numpy_stubs :

Eventually, once development has stabilized, we expect to merge these type stubs into the main NumPy repository.

Also regarding

The things that are hard to do in Numba are also nontrivial in Cython and basically require you to write C/C++ anyway. IMO Pybind11 dominates this domain.

I think Cython has a clear advantage in custom types. Cdef classes can implement the iter protocol for example.

Not saying these should disqualify numba, just making sure they are raised.

Edit: Just saw @shoyer 's comment. I guess that clinches the issue re type hints for now.

tylerjereddy commented 6 years ago

Some thoughts from computational geometry / SciPy side:

hameerabbasi commented 6 years ago

does either approach have a clear advantage with "ragged" data structures (arrays where the rows have differing numbers of columns -- iterating over polygons, etc.)?

Since, in numba, list is accelerated, it might be that list[list] is accelerated. I'm not sure, I haven't tested. Passing it outside accelerated code might be an issue though.

numba/numba#2788 might be a better place to put this.

shoyer commented 6 years ago

I mentioned this in the SciPy mailing list discussion, but it would be interesting if Numba's ahead of time compilation suffices for our use cases. This would require specifying the number of array dimensions for all functions ahead of time, but if we can distribute pre-compiled extension modules that eliminates one of the largest challenges for Numba (the new runtime dependency).

shoyer commented 6 years ago

@datnamer perhaps we worded the README for numpy_stubs too strongly. It would be more accurate to say that we hope rather than expect to merge the stubs into numpy proper.

hameerabbasi commented 6 years ago

I mentioned this in the SciPy mailing list discussion, but it would be interesting if Numba's ahead of time compilation suffices for our use cases.

We lose out on potential performance and ease of use (we have to specify not just the dimensions, but order, type of each array, etc). Doesn't seem like an ideal solution to me. I'd rather have the dependency. Not to mention that for N-d, it may not be ideal, even with Numpy's 32 dimension limit.

AOT compilation only allows for regular functions, not ufuncs.

Welp. That throws it out the window.

seibert commented 6 years ago

Hi, trying to catch up on this very fast moving discussion thread, so let me try to shotgun answer a bunch of questions I'm seeing:

For the 1.x series, we would make the commitment not to break any APIs we indicate as stable or break any working compiled code. (We can change the behavior of non-working code, or make more code work, of course.)

OK, apologies for the length of this. I think this answers a number of questions I saw, but feel free to prod me if I missed a question.

hameerabbasi commented 6 years ago

@seibert Thanks for the important insights into Numba. Two questions: Is accelerating list[tuple[int]] supported (given that tuple size will be fixed throughout the list)? I know dicts are upcoming, will they support dict[tuple[int]] acceleration?

sklam commented 6 years ago

@hameerabbasi

Yes and yes. Numba treats tuple[int] as an immutable value type. Numba's list/set can take any element of non-reference type. That usually means numbers and tuple of numbers. But the fixed-size-ness of tuple may make the usage of tuple in list awkward.