medallia / Word2VecJava

Word2Vec Java Port
MIT License
186 stars 81 forks source link

Makes sure we don't pull the whole corpus into memory when training #23

Open dirkgr opened 9 years ago

dirkgr commented 9 years ago

Explanation in the comments.

dirkgr commented 9 years ago

Ping?

dirkgr commented 9 years ago

This might be a fix for #20.

Hronom commented 9 years ago

Any info about then this pull request will be accepted? This changes helps me train 2,4 GB of data...

Iakovenko-Oleksandr commented 9 years ago

The fix is really useful! It took us 70+ Gb of RAM to train a model without it. Now it's only about 10Gb. I wonder, why the improvement so essential hasn't yet been added to master?

dirkgr commented 9 years ago

@wko27 had some concerns about the quality of the resulting vectors. @Hronom, @Iakovenko-Oleksandr, do you have any problems with your results?

Iakovenko-Oleksandr commented 9 years ago

What kind of problems? It really feels like changed, but we still don't have any tools to evaluate adequacy of model... Closest vectors look more or less fine.