more efficient representation of xs

daslu commented 10 years ago

Thank you for this very nice library.

Currently, 'train expects xs to be a sequence of objects which are either maps or sets.

For cheaper use of CPU and memory with large non-sparse datasets, it may be important to allow for other representations of the 'xs input to 'train.

Here are some possible types for xs:

a map from feature name to column (represented as a java array, for example).
some more general protocol, e.g., a core.matrix Dataset: https://github.com/mikera/core.matrix/blob/develop/src/main/clojure/clojure/core/matrix/impl/dataset.clj

I suggest to make 'train a multimethod, which can get more than one type of xs. I would be happy to implement this change. Any comments?

daslu commented 10 years ago

Here is some basic information regarding memory consumptions of clojure hashmaps: http://nyeggen.com/blog/2013/02/03/space-efficiency-of-clojures-collections/

lynaghk commented 10 years ago

This would be a pretty substantial change. My gut feeling is to prefer protocols over multimethods.

However, I'd like to get @mikera's opinion about this sort of thing, since I imagine he has dealt with similar issues as part of core.matrix and his machine learning work. Any thoughts, Mike?

mikera commented 10 years ago

It certainly makes sense to allow for efficient encodings of feature vectors.

In most of my stuff the input to the actual algorithm is a big Clojure vector where each element is a core.matrix / vectorz-clj dense 1D feature vectors (for both xs and ys). I normally need random access into the training data so sequences of vectors aren't a good idea, but I think sequences would be OK for SVM (it's single pass, right?)

Obviously, raw data doesn't come in precisely this format. So I'm actually designing a mini-library to do some of this kind of stuff (design phase as yet...so sorry nothing to share right now) the idea would be to allow a translation of (Arbitrary Clojure data -> Fixed length core.matrix (Vectorz) 1D feature vector).

I think protocols would be sufficient for what is needed here. Although multimethod overhead probably wouldn't hurt too much, since this isn't in the inner loop.

P.S. if you are doing sparse vector stuff, vectorz-clj now has some semi-decent support for sparse vectors.

lynaghk commented 10 years ago

The bulk of this project was written almost 4 years ago, and to fix this issue (and others) I'd need to do a total rewrite. A total rewrite is low on my priority list right now, and likely will be until:

I need to use this library again (unlikely anytime soon)
A client at my work can support a rewrite during a consulting engagement

@daslu unless the latter is a possibility, your best bet to reducing memory usage is to just call liblinear directly with their sparse vector classes.

daslu commented 10 years ago

Many thanks for your remarks, @lynaghk and @mikera. You are so right, protocols are more appropriate here.

I am testing some possible use of protocols at my code (not a large change). I will let you know if it turns out useful.

daslu commented 10 years ago

... at my code ... I meant, at my fork.

lynaghk / clj-liblinear

more efficient representation of xs #6