Open cstur4 opened 8 years ago
Although remapping is expensive, it's currently the only feasible thing to do with Glint since all matrices and vectors are stored as dense arrays.
One of my goals right now is to write a roadmap that outlines several things we want to implement in the near future. It includes sparse matrices and vectors. With sparsity it will be possible to create very large feature spaces with little/no memory overhead. However, because everything is indexed with Long
values, it would still be limited to a size of 2^63.
How long will it take? I am eager to intergrate glint with YARN with Long key features. I have a preliminary version, but it takes long time to work on bigger data.
I can't give an accurate time estimate at the moment. In terms of importance, I prioritize implementing fault tolerance over other features right now, so it could be on the order of months before I get to sparsity.
And even then, I'm not sure how well this will work... Sparse data structures typically come with considerable memory overhead (one or more objects per key/value pair), which the JVM garbage collector unfortunately does not like. I'm considering using something like debox or scala-offheap to bypass this garbage collection problem, but both are rather experimental.
I used debox in my algorithm, and it helps a lot. Now I transform ids to continuous ones, and I look forward to get your help to intergration glint with spark. Thanks a lot.
I implemented a key-value based partitioner to avoid remapping feature id. Hash-based version may be more scalable to big data. I am glad to send a pull request if there is a necessary.
@cstur4 I'd like to see that PR - for sparse models I'm looking at it would be very important for scalability.
I have millions of discrete 64-byte features, remapping features to continuous ids is expensive. Can glint support that?