piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.71k stars 4.38k forks source link

Add more datasets/models to gensim-data #1717

Open menshikh-iv opened 7 years ago

menshikh-iv commented 7 years ago

The next gensim release will contain a new data/model storage feature. For this reason, we want to add models/datasets to make life of our users simpler :cat2:. The data repository is https://github.com/RaRe-Technologies/gensim-data.

This issue unites all previous issues about datasets/model list: https://github.com/RaRe-Technologies/gensim/issues/1453, https://github.com/RaRe-Technologies/gensim/issues/717, https://github.com/RaRe-Technologies/gensim/issues/746.

If you want to help - follow the instructions from https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model

If you want to see what models are already available - look to https://github.com/RaRe-Technologies/gensim-data#available-data

What will be nice to add:

piskvorky commented 7 years ago

@menshikh-iv Kaggle has a bunch of datasets too: https://www.kaggle.com/datasets?sortBy=relevance&group=all&search=text

The primary added value of the gensim API is being able to download and open the corpus directly and easily (not just the data, there are a lot of dataset storage repos already). Plus the pre-trained plug-n-play models of course.

So that's how we present this dataset feature -- that's what's different.

akutuzov commented 7 years ago

@menshikh-iv So, the models should be available via a direct download link, right? What format they should follow? Gzipped Gensim pickle, word2vec text/binary files, or...?

menshikh-iv commented 7 years ago

@akutuzov for now, all data must be .gz from any filetype (.txt, .bin, etc). Only one limitation exists - this must be only one file (because we store all models here and don't store any links, this is generated in an automatic way). if necessary - we can change this logic, it's simple (add the links section to lists.json + change the logic on the gensim side a little bit).

piskvorky commented 7 years ago

@akutuzov we can host the models, and they can be pretty much anything / anywhere. I'd say the technical side is secondary.

The primary concerns/requirements are: