Open piskvorky opened 8 years ago
Scipy is also used to get the BLAS operations in the word2vec/doc2vec cython code. (Maybe there's a way to use the BLAS directly, but right now it's via Scipy.)
spaCy hasn't needed complicated vector maths, because I've just been using Averaged Perceptron. The disadvantage of this is that it's been harder to say, try Adagrad instead, or plug in the group lasso, etc. But on the other hand I've been able to write code I can read easily, and performance is pretty good.
Here's my sparse array: https://github.com/honnibal/thinc/blob/master/thinc/sparse.pyx https://github.com/honnibal/thinc/blob/master/thinc/sparse.pxd
The idea is that the cdef class
wrapper is stateless, and everything is contained in the SparseArrayC
struct. There are then static methods to do all manipulations. These could be isolated functions, but it gives me better feels to collect them on a class. They're inlined due to a limitation in Cython --- it only understands the staticmethod decorator on inlined methods. I've been meaning to report this bug, but haven't gotten around to it. These arrays are stored in a hash table, to make up the sparse matrix.
The truth is I've always found the traditional sparse matrix descriptions confusing, and the scipy.sparse
docs quite opaque. So I've never really understood the established data structures for this, and I figure my problem has fairly extreme values. So it made sense to me to just write some code.
Great, thanks! Nothing beats the flexibility and control of rolling your own solution :) I like your approach.
To be honest, I've had my own ideas around sparse matrices as well. There are some code paths in gensim where having to know the final shape complicates things. I believe the concept of shape is often irrelevant, even misleading. Because we don't need the full array API, with all its flexibility and indexing -- just treat the shape as infinite (it's all zeros!) and support those matrix operations we really need, in the specific online algorithms.
With scipy deprecating its sparsetools
suite #462 , it may be worth merging the two tickets and rolling our own "streamed sparse matrix" implementation.
@gojomo good point. We could hopefully get away with just NumPy's BLAS bindings, somehow. Asking users to link BLAS manually sounds like a pain (for them and for us).
In recent versions of NumPy, you can roll sparse operations with e.g. numpy.add.reduceat
in a fairly efficient manner. It'd be parallelizable too. For example, csr matrix-vector dot product looks something like
mvprod = np.add.reduceat(data * vec[col_idx], row_idx)
there's some overhead with the vec[col_idx]
expansion, but if vec
isn't changing often, it can be cached. In any case, it's nice to have a NumPy fallback which isn't hideously slow.
The main disadvantage is needed to know some of the sparse foramt details, e.g the above assumes CSR format.
A GraphBLAS matrix is a CSC matrix, so the way to get rid of Python may be to use ctypes to interface with GraphBLAS and extricate the minimum amount of CSC-related code from scipy.sparse as required. We can pass pointers to the underlying data in the CSC matrices to GraphBLAS.
Gensim is using CSR matrices in some situations..
Scipy in gensim is only needed for its
scipy.sparse
toolbox for handling sparse matrices, which is a huge overkill. We'd like to finally get rid of scipy as a dependency, keeping only numpy.Now that we have gensim wheels for Windows (@tmylk), even a compiled gensim module to replace
scipy.sparse
might no longer be a show stopper.Or maybe get the scipy people to split out
scipy.sparse
into its own, independent package... that would be the best and most flexible (and easiest for us).@honnibal How do you deal with sparse matrices in spaCy?