Can glint support discrete feature id?

rjagerman / glint

Glint: High performance scala parameter server

MIT License

168 stars 62 forks source link

Can glint support discrete feature id? #45

Open cstur4 opened 8 years ago

cstur4 commented 8 years ago

I have millions of discrete 64-byte features, remapping features to continuous ids is expensive. Can glint support that?

rjagerman commented 8 years ago

Although remapping is expensive, it's currently the only feasible thing to do with Glint since all matrices and vectors are stored as dense arrays.

One of my goals right now is to write a roadmap that outlines several things we want to implement in the near future. It includes sparse matrices and vectors. With sparsity it will be possible to create very large feature spaces with little/no memory overhead. However, because everything is indexed with Long values, it would still be limited to a size of 2^63.

cstur4 commented 8 years ago

How long will it take? I am eager to intergrate glint with YARN with Long key features. I have a preliminary version, but it takes long time to work on bigger data.

rjagerman commented 8 years ago

I can't give an accurate time estimate at the moment. In terms of importance, I prioritize implementing fault tolerance over other features right now, so it could be on the order of months before I get to sparsity.

And even then, I'm not sure how well this will work... Sparse data structures typically come with considerable memory overhead (one or more objects per key/value pair), which the JVM garbage collector unfortunately does not like. I'm considering using something like debox or scala-offheap to bypass this garbage collection problem, but both are rather experimental.

cstur4 commented 8 years ago

I used debox in my algorithm, and it helps a lot. Now I transform ids to continuous ones, and I look forward to get your help to intergration glint with spark. Thanks a lot.

cstur4 commented 8 years ago

I implemented a key-value based partitioner to avoid remapping feature id. Hash-based version may be more scalable to big data. I am glad to send a pull request if there is a necessary.

MLnick commented 8 years ago

@cstur4 I'd like to see that PR - for sparse models I'm looking at it would be very important for scalability.