Add more datasets/models to gensim-data

menshikh-iv commented 7 years ago

The next gensim release will contain a new data/model storage feature. For this reason, we want to add models/datasets to make life of our users simpler :cat2:. The data repository is https://github.com/RaRe-Technologies/gensim-data.

This issue unites all previous issues about datasets/model list: https://github.com/RaRe-Technologies/gensim/issues/1453, https://github.com/RaRe-Technologies/gensim/issues/717, https://github.com/RaRe-Technologies/gensim/issues/746.

If you want to help - follow the instructions from https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model

If you want to see what models are already available - look to https://github.com/RaRe-Technologies/gensim-data#available-data

What will be nice to add:

[ ] All datasets used in docs/notebooks + update notebooks (using new API)
[ ] Trained models available in gensim (HARD), typically small model (toy example, only english) and big model (based on Wikipedia) with different languages (en, de, ru, es)
[ ] Word vectors (english and russian), thanks @akutuzov
[ ] NSF Research Award Abstracts 1990-2003, thanks @macks22
[ ] Reuters-21578 Text Categorization Collection
[ ] Reuters Corpus Volume I (RCV1) v2
[ ] WebKB -- original here
[ ] USPTO Patent Grant Full Text subsets
[ ] PubMed corpus, thanks @piskvorky
[ ] USPTO patents

piskvorky commented 7 years ago

@menshikh-iv Kaggle has a bunch of datasets too: https://www.kaggle.com/datasets?sortBy=relevance&group=all&search=text

The primary added value of the gensim API is being able to download and open the corpus directly and easily (not just the data, there are a lot of dataset storage repos already). Plus the pre-trained plug-n-play models of course.

So that's how we present this dataset feature -- that's what's different.

akutuzov commented 7 years ago

@menshikh-iv So, the models should be available via a direct download link, right? What format they should follow? Gzipped Gensim pickle, word2vec text/binary files, or...?

menshikh-iv commented 7 years ago

@akutuzov for now, all data must be .gz from any filetype (.txt, .bin, etc). Only one limitation exists - this must be only one file (because we store all models here and don't store any links, this is generated in an automatic way). if necessary - we can change this logic, it's simple (add the links section to lists.json + change the logic on the gensim side a little bit).

piskvorky commented 7 years ago

@akutuzov we can host the models, and they can be pretty much anything / anywhere. I'd say the technical side is secondary.

The primary concerns/requirements are:

dataset must be relevant to gensim (unsupervised text analysis, topic modeling, embeddings etc)
the proposal must contain a clear use case: 1) concrete code for loading up the dataset in Python, 2) using it for some task that the users understand (otherwise the dataset has no impact, nobody knows what to do with it -- we don't want to be a repository of random data)

piskvorky / gensim

Add more datasets/models to gensim-data #1717