Closed daslu closed 10 years ago
Here is some basic information regarding memory consumptions of clojure hashmaps: http://nyeggen.com/blog/2013/02/03/space-efficiency-of-clojures-collections/
This would be a pretty substantial change. My gut feeling is to prefer protocols over multimethods.
However, I'd like to get @mikera's opinion about this sort of thing, since I imagine he has dealt with similar issues as part of core.matrix and his machine learning work. Any thoughts, Mike?
It certainly makes sense to allow for efficient encodings of feature vectors.
In most of my stuff the input to the actual algorithm is a big Clojure vector where each element is a core.matrix / vectorz-clj dense 1D feature vectors (for both xs and ys). I normally need random access into the training data so sequences of vectors aren't a good idea, but I think sequences would be OK for SVM (it's single pass, right?)
Obviously, raw data doesn't come in precisely this format. So I'm actually designing a mini-library to do some of this kind of stuff (design phase as yet...so sorry nothing to share right now) the idea would be to allow a translation of (Arbitrary Clojure data -> Fixed length core.matrix (Vectorz) 1D feature vector).
I think protocols would be sufficient for what is needed here. Although multimethod overhead probably wouldn't hurt too much, since this isn't in the inner loop.
P.S. if you are doing sparse vector stuff, vectorz-clj now has some semi-decent support for sparse vectors.
The bulk of this project was written almost 4 years ago, and to fix this issue (and others) I'd need to do a total rewrite. A total rewrite is low on my priority list right now, and likely will be until:
@daslu unless the latter is a possibility, your best bet to reducing memory usage is to just call liblinear directly with their sparse vector classes.
Many thanks for your remarks, @lynaghk and @mikera. You are so right, protocols are more appropriate here.
I am testing some possible use of protocols at my code (not a large change). I will let you know if it turns out useful.
... at my code ... I meant, at my fork.
Thank you for this very nice library.
Currently, 'train expects xs to be a sequence of objects which are either maps or sets.
For cheaper use of CPU and memory with large non-sparse datasets, it may be important to allow for other representations of the 'xs input to 'train.
Here are some possible types for xs:
I suggest to make 'train a multimethod, which can get more than one type of xs. I would be happy to implement this change. Any comments?