piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.68k stars 4.38k forks source link

Get rid of scipy #557

Open piskvorky opened 8 years ago

piskvorky commented 8 years ago

Scipy in gensim is only needed for its scipy.sparse toolbox for handling sparse matrices, which is a huge overkill. We'd like to finally get rid of scipy as a dependency, keeping only numpy.

Now that we have gensim wheels for Windows (@tmylk), even a compiled gensim module to replace scipy.sparse might no longer be a show stopper.

Or maybe get the scipy people to split out scipy.sparse into its own, independent package... that would be the best and most flexible (and easiest for us).

@honnibal How do you deal with sparse matrices in spaCy?

gojomo commented 8 years ago

Scipy is also used to get the BLAS operations in the word2vec/doc2vec cython code. (Maybe there's a way to use the BLAS directly, but right now it's via Scipy.)

honnibal commented 8 years ago

spaCy hasn't needed complicated vector maths, because I've just been using Averaged Perceptron. The disadvantage of this is that it's been harder to say, try Adagrad instead, or plug in the group lasso, etc. But on the other hand I've been able to write code I can read easily, and performance is pretty good.

Here's my sparse array: https://github.com/honnibal/thinc/blob/master/thinc/sparse.pyx https://github.com/honnibal/thinc/blob/master/thinc/sparse.pxd

The idea is that the cdef class wrapper is stateless, and everything is contained in the SparseArrayC struct. There are then static methods to do all manipulations. These could be isolated functions, but it gives me better feels to collect them on a class. They're inlined due to a limitation in Cython --- it only understands the staticmethod decorator on inlined methods. I've been meaning to report this bug, but haven't gotten around to it. These arrays are stored in a hash table, to make up the sparse matrix.

The truth is I've always found the traditional sparse matrix descriptions confusing, and the scipy.sparse docs quite opaque. So I've never really understood the established data structures for this, and I figure my problem has fairly extreme values. So it made sense to me to just write some code.

piskvorky commented 8 years ago

Great, thanks! Nothing beats the flexibility and control of rolling your own solution :) I like your approach.

To be honest, I've had my own ideas around sparse matrices as well. There are some code paths in gensim where having to know the final shape complicates things. I believe the concept of shape is often irrelevant, even misleading. Because we don't need the full array API, with all its flexibility and indexing -- just treat the shape as infinite (it's all zeros!) and support those matrix operations we really need, in the specific online algorithms.

With scipy deprecating its sparsetools suite #462 , it may be worth merging the two tickets and rolling our own "streamed sparse matrix" implementation.

@gojomo good point. We could hopefully get away with just NumPy's BLAS bindings, somehow. Asking users to link BLAS manually sounds like a pain (for them and for us).

maedoc commented 7 years ago

In recent versions of NumPy, you can roll sparse operations with e.g. numpy.add.reduceat in a fairly efficient manner. It'd be parallelizable too. For example, csr matrix-vector dot product looks something like

mvprod = np.add.reduceat(data * vec[col_idx], row_idx)

there's some overhead with the vec[col_idx] expansion, but if vec isn't changing often, it can be cached. In any case, it's nice to have a NumPy fallback which isn't hideously slow.

The main disadvantage is needed to know some of the sparse foramt details, e.g the above assumes CSR format.

ghost commented 6 years ago

A GraphBLAS matrix is a CSC matrix, so the way to get rid of Python may be to use ctypes to interface with GraphBLAS and extricate the minimum amount of CSC-related code from scipy.sparse as required. We can pass pointers to the underlying data in the CSC matrices to GraphBLAS.

Gensim is using CSR matrices in some situations..