Update descriptions for data

menshikh-iv commented 6 years ago

What's done:

Added missed links to data
Slightly re-write descriptions

piskvorky commented 6 years ago

Missing dataset facts:

[x] dataset size, in number of records (e.g. 404,301 pairs in Quora; no need for words like over or approximately)
[x] downloaded file size, in bytes (e.g. 21,684,784 bytes for quora-duplicate-questions.gz)
[ ] more concrete descriptions, to help users understand what the dataset is for or how it was created (e.g. Cleaned small sample from Wikipedia. is useless -- which Wikipedia? date, language? cleaned how? why small? Same with Extracted Wikipedia dump from October 2017 -- extracted what? why would I use this? how?)
[x] consistent parameters field: some datasets have it (gloVe), others not (text8, wikipedia).
[ ] domain-specific datasets missing (and these are most interesting datasets): patents, contracts, CVs, medical abstracts...
[x] missing license information -- are we allowed to (re)distribute these datasets?

Dataset parameters like the number of records or file size should go into the description, or even better, a clear line.json field so we can display these parameters to users. It's important to know how large a dataset is before trying to download it.

menshikh-iv commented 6 years ago

@piskvorky domain-specific dataset isn't the missing fact if someone suggests domain-dataset - I'll add. About parameters - this field only for models (this field doesn't exist for datasets for obvious reasons). About Extracted Wikipedia dump from October 2017 - read next sentence, extraction is described. About another items - agree, let me create table with needed information

menshikh-iv commented 6 years ago

Datasets

20-newsgroups

the number of rows: 18846 size: 14483581 (~14MB) row format: dict links: http://qwone.com/~jason/20Newsgroups/ reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/20-newsgroups/__init__.py license: not found fields:

"topic" - name of topic (20 variant of possible values)
"set" - marker of original split (possible values "train" and "test")
"data"
"id" - original id inferred from folder name

fake-news

the number of rows: 12999 size: 20102776 (~20MB) row format: dict links: https://www.kaggle.com/mrisdal/fake-news reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py license: https://creativecommons.org/publicdomain/zero/1.0/ fields:

"crawled" - date the story was archived
"ord_in_thread"
"published" - date published
"participants_count" - number of participants
"shares" - number of Facebook shares
"replies_count" - number of replies
"main_img_url" - image from story
"spam_score" - data from webhose.io
"uuid" - unique identifier
"language" - data from webhose.io
"title" - title of story
"country" - data from webhose.io
"domain_rank" - data from webhose.io
"author" - author of story
"comments" - number of Facebook comments
"site_url" - site URL from BS detector
"text" - text of story
"thread_title"
"type" - type of website (label from BS detector)
"likes" - number of Facebook likes

text8

the number of rows: 1701 size: 33182058 (~32 MB) row format: list of str (tokens) links: http://mattmahoney.net/dc/textdata.html reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py license: not found description: I have no idea what's write here, minimum information on original link about text8

quora-duplicate-questions

the number of rows: 404290 size: 21684784 (~21 MB) row format: dict links: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/quora-duplicate-questions/__init__.py license: probably https://www.quora.com/about/tos (same as for all quora site) fields:

"question1" - the full text of each question
"question2" - the full text of each question
"qid2" - unique ids of each question
"qid1" - unique ids of each question
"id" - the id of a training set question pair
"is_duplicate" -the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

wiki-english-20171001

the number of rows: 4924894 size: 6516051717 (~ 6.1 GB) row format: dict links: https://dumps.wikimedia.org/enwiki/20171001/ reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/wiki-english-20171001/__init__.py license: https://dumps.wikimedia.org/legal.html fields:

"section_texts" - list of body of sections
"section_titles" - list of titles of sections
"title" - Title of wiki article

Models

glove-wiki-gigaword-50

type: glove number of vectors: 400000 size: 69182535 (~ 66MB) dimensions: 50 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

glove-wiki-gigaword-100

type: glove number of vectors: 400000 size: 134300434 (~ 129MB) dimensions: 100 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-100/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

glove-wiki-gigaword-200

type: glove number of vectors: 400000 size: 264336934 (~ 253MB) dimensions: 200 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-200/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

glove-wiki-gigaword-300

type: glove number of vectors: 400000 size: 394362229 (~ 377MB) dimensions: 300 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-300/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

glove-twitter-25

type: glove number of vectors: 1193514 size: 109885004 (~ 105MB) dimensions: 25 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-25/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

glove-twitter-50

type: glove number of vectors: 1193514 size: 209216938 (~ 200MB) dimensions: 50 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-50/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

glove-twitter-100

type: glove number of vectors: 1193514 size: 405932991 (~ 388MB) dimensions: 100 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-100/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

glove-twitter-200

type: glove number of vectors: 1193514 size: 795373100 (~759 MB) dimensions: 200 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-200/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

word2vec-google-news-300

type: word2vec number of vectors: 3000000 size: 1743563840 (~ 1.7GB) dimensions: 300 links: https://code.google.com/archive/p/word2vec/ papers: https://arxiv.org/abs/1301.3781, https://arxiv.org/abs/1310.4546, https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/__init__.py license: not found base_dataset: Google News (about 100 billion words)

piskvorky commented 6 years ago

14Mb => 14MB (unless you really mean megabits instead of megabytes, which doesn't sounds like a good idea)

I don't see any fundamental difference between corpora and models, both should have parameters IMO (both are created using specific parameter choices).

For license not found: that means we have no right whatsoever (by default). Can you ask the authors for permission?

menshikh-iv commented 6 years ago

@piskvorky All used & shared 20news and text8 (very popular small text datasets), I think that we can too (no need to ask about this concrete case, but typically needs I agree).

piskvorky commented 6 years ago

OK.

One more request: please add a link to the "reader" implementation (link to the concrete github source code) to the metadata, as a new field.

It will serve both as an "example" to make the format clearer (so we don't have to describe it ourselves -- Python is "executable pseudocode"), as well as a reminder that we want a reader implementation always.

piskvorky commented 6 years ago

Looks good, thanks!

Some of the readers have broken formatting (trailing whitespace, extra newlines at the end). Plus run a code style check (same guidelines as gensim).

And let's add a very clear and prominent warning into the README that each dataset comes with its own original license, which the users should study before using the dataset. We are not the copyright holders, and are not responsible for any potential license breaches by users.

piskvorky / gensim-data