piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Update descriptions for data #4

Closed menshikh-iv closed 6 years ago

menshikh-iv commented 6 years ago

What's done:

piskvorky commented 6 years ago

Missing dataset facts:

Dataset parameters like the number of records or file size should go into the description, or even better, a clear line.json field so we can display these parameters to users. It's important to know how large a dataset is before trying to download it.

menshikh-iv commented 6 years ago

@piskvorky domain-specific dataset isn't the missing fact if someone suggests domain-dataset - I'll add. About parameters - this field only for models (this field doesn't exist for datasets for obvious reasons). About Extracted Wikipedia dump from October 2017 - read next sentence, extraction is described. About another items - agree, let me create table with needed information

menshikh-iv commented 6 years ago

Datasets

20-newsgroups

the number of rows: 18846 size: 14483581 (~14MB) row format: dict links: http://qwone.com/~jason/20Newsgroups/ reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/20-newsgroups/__init__.py license: not found fields:

fake-news

the number of rows: 12999 size: 20102776 (~20MB) row format: dict links: https://www.kaggle.com/mrisdal/fake-news reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py license: https://creativecommons.org/publicdomain/zero/1.0/ fields:

text8

the number of rows: 1701 size: 33182058 (~32 MB) row format: list of str (tokens) links: http://mattmahoney.net/dc/textdata.html reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py license: not found description: I have no idea what's write here, minimum information on original link about text8

quora-duplicate-questions

the number of rows: 404290 size: 21684784 (~21 MB) row format: dict links: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/quora-duplicate-questions/__init__.py license: probably https://www.quora.com/about/tos (same as for all quora site) fields:

wiki-english-20171001

the number of rows: 4924894 size: 6516051717 (~ 6.1 GB) row format: dict links: https://dumps.wikimedia.org/enwiki/20171001/ reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/wiki-english-20171001/__init__.py license: https://dumps.wikimedia.org/legal.html fields:

Models

glove-wiki-gigaword-50

type: glove number of vectors: 400000 size: 69182535 (~ 66MB) dimensions: 50 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

glove-wiki-gigaword-100

type: glove number of vectors: 400000 size: 134300434 (~ 129MB) dimensions: 100 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-100/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

glove-wiki-gigaword-200

type: glove number of vectors: 400000 size: 264336934 (~ 253MB) dimensions: 200 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-200/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

glove-wiki-gigaword-300

type: glove number of vectors: 400000 size: 394362229 (~ 377MB) dimensions: 300 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-300/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

glove-twitter-25

type: glove number of vectors: 1193514 size: 109885004 (~ 105MB) dimensions: 25 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-25/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

glove-twitter-50

type: glove number of vectors: 1193514 size: 209216938 (~ 200MB) dimensions: 50 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-50/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

glove-twitter-100

type: glove number of vectors: 1193514 size: 405932991 (~ 388MB) dimensions: 100 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-100/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

glove-twitter-200

type: glove number of vectors: 1193514 size: 795373100 (~759 MB) dimensions: 200 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-200/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

word2vec-google-news-300

type: word2vec number of vectors: 3000000 size: 1743563840 (~ 1.7GB) dimensions: 300 links: https://code.google.com/archive/p/word2vec/ papers: https://arxiv.org/abs/1301.3781, https://arxiv.org/abs/1310.4546, https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/__init__.py license: not found base_dataset: Google News (about 100 billion words)

piskvorky commented 6 years ago

14Mb => 14MB (unless you really mean megabits instead of megabytes, which doesn't sounds like a good idea)

I don't see any fundamental difference between corpora and models, both should have parameters IMO (both are created using specific parameter choices).

For license not found: that means we have no right whatsoever (by default). Can you ask the authors for permission?

menshikh-iv commented 6 years ago

@piskvorky All used & shared 20news and text8 (very popular small text datasets), I think that we can too (no need to ask about this concrete case, but typically needs I agree).

piskvorky commented 6 years ago

OK.

One more request: please add a link to the "reader" implementation (link to the concrete github source code) to the metadata, as a new field.

It will serve both as an "example" to make the format clearer (so we don't have to describe it ourselves -- Python is "executable pseudocode"), as well as a reminder that we want a reader implementation always.

piskvorky commented 6 years ago

Looks good, thanks!

Some of the readers have broken formatting (trailing whitespace, extra newlines at the end). Plus run a code style check (same guidelines as gensim).

And let's add a very clear and prominent warning into the README that each dataset comes with its own original license, which the users should study before using the dataset. We are not the copyright holders, and are not responsible for any potential license breaches by users.