Open menshikh-iv opened 7 years ago
@menshikh-iv Kaggle has a bunch of datasets too: https://www.kaggle.com/datasets?sortBy=relevance&group=all&search=text
The primary added value of the gensim API is being able to download and open the corpus directly and easily (not just the data, there are a lot of dataset storage repos already). Plus the pre-trained plug-n-play models of course.
So that's how we present this dataset feature -- that's what's different.
@menshikh-iv So, the models should be available via a direct download link, right? What format they should follow? Gzipped Gensim pickle, word2vec text/binary files, or...?
@akutuzov for now, all data must be .gz
from any filetype (.txt
, .bin
, etc).
Only one limitation exists - this must be only one file (because we store all models here and don't store any links, this is generated in an automatic way). if necessary - we can change this logic, it's simple (add the links section to lists.json
+ change the logic on the gensim side a little bit).
@akutuzov we can host the models, and they can be pretty much anything / anywhere. I'd say the technical side is secondary.
The primary concerns/requirements are:
The next gensim release will contain a new data/model storage feature. For this reason, we want to add models/datasets to make life of our users simpler :cat2:. The data repository is https://github.com/RaRe-Technologies/gensim-data.
This issue unites all previous issues about datasets/model list: https://github.com/RaRe-Technologies/gensim/issues/1453, https://github.com/RaRe-Technologies/gensim/issues/717, https://github.com/RaRe-Technologies/gensim/issues/746.
If you want to help - follow the instructions from https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model
If you want to see what models are already available - look to https://github.com/RaRe-Technologies/gensim-data#available-data
What will be nice to add:
docs/notebooks
+ update notebooks (using new API)