Open nschneid opened 10 years ago
scipy sparse matrixes are the only game in town for numpy world. the documentation and stuff is a little disappointing, though
Via scikit-learn docs, I think dropping the .toarray()
from https://github.com/redpony/creg/blob/2a3563b1e1d95cf9c68ef40a4afb67a9279be241/creg2/creg2.py#L40 will leave it as a sparse matrix. Does that work? @as1986, might be worth trying to see if it saves memory and speeds things up.
I have something that is hopefully a step in the direction of a working implementation with sparse matrices. https://gist.github.com/nschneid/9748235 (includes non-sparse code for comparison)
The sparse code is much slower than the original code on the iris dataset, which, after all, has dense instances. I have not tested it on a text dataset, and there are probably inefficiencies (e.g., I am making a sparse matrix for Hsqrt
to get the division to work out even though it is really dense; also there is less reuse of matrix instances due to limitations in the APIs, though there may be better workarounds).
For testing purposes, I have modified the code to only look at the first 10 training examples rather than sampling a different minibatch for each iteration. On the iris dataset, the loss values that the sparse code prints are close but not exactly matching those of the original code. Unclear whether this is a bug or a numerical precision issue.
This code does not implement regularization. To do sparse L2 regularization there is a trick from Alex Smola's blog that I explain in my features document.
Perhaps with scipy data structures: http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.dok_matrix.html#scipy.sparse.dok_matrix