Closed menshikh-iv closed 6 years ago
Missing dataset facts:
over
or approximately
)quora-duplicate-questions.gz
)Cleaned small sample from Wikipedia.
is useless -- which Wikipedia? date, language? cleaned how? why small? Same with Extracted Wikipedia dump from October 2017
-- extracted what? why would I use this? how?)parameters
field: some datasets have it (gloVe), others not (text8, wikipedia).Dataset parameters like the number of records or file size should go into the description, or even better, a clear line.json
field so we can display these parameters to users. It's important to know how large a dataset is before trying to download it.
@piskvorky domain-specific dataset isn't the missing fact if someone suggests domain-dataset - I'll add.
About parameters - this field only for models (this field doesn't exist for datasets for obvious reasons).
About Extracted Wikipedia dump from October 2017
- read next sentence, extraction is described.
About another items - agree, let me create table with needed information
the number of rows: 18846 size: 14483581 (~14MB) row format: dict links: http://qwone.com/~jason/20Newsgroups/ reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/20-newsgroups/__init__.py license: not found fields:
the number of rows: 12999 size: 20102776 (~20MB) row format: dict links: https://www.kaggle.com/mrisdal/fake-news reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py license: https://creativecommons.org/publicdomain/zero/1.0/ fields:
the number of rows: 1701 size: 33182058 (~32 MB) row format: list of str (tokens) links: http://mattmahoney.net/dc/textdata.html reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py license: not found description: I have no idea what's write here, minimum information on original link about text8
the number of rows: 404290 size: 21684784 (~21 MB) row format: dict links: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/quora-duplicate-questions/__init__.py license: probably https://www.quora.com/about/tos (same as for all quora site) fields:
the number of rows: 4924894 size: 6516051717 (~ 6.1 GB) row format: dict links: https://dumps.wikimedia.org/enwiki/20171001/ reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/wiki-english-20171001/__init__.py license: https://dumps.wikimedia.org/legal.html fields:
type: glove number of vectors: 400000 size: 69182535 (~ 66MB) dimensions: 50 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)
type: glove number of vectors: 400000 size: 134300434 (~ 129MB) dimensions: 100 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-100/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)
type: glove number of vectors: 400000 size: 264336934 (~ 253MB) dimensions: 200 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-200/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)
type: glove number of vectors: 400000 size: 394362229 (~ 377MB) dimensions: 300 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-300/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)
type: glove number of vectors: 1193514 size: 109885004 (~ 105MB) dimensions: 25 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-25/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
type: glove number of vectors: 1193514 size: 209216938 (~ 200MB) dimensions: 50 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-50/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
type: glove number of vectors: 1193514 size: 405932991 (~ 388MB) dimensions: 100 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-100/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
type: glove number of vectors: 1193514 size: 795373100 (~759 MB) dimensions: 200 links: https://nlp.stanford.edu/projects/glove/ papers: https://nlp.stanford.edu/pubs/glove.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-twitter-200/__init__.py license: http://opendatacommons.org/licenses/pddl/ base_dataset: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
type: word2vec number of vectors: 3000000 size: 1743563840 (~ 1.7GB) dimensions: 300 links: https://code.google.com/archive/p/word2vec/ papers: https://arxiv.org/abs/1301.3781, https://arxiv.org/abs/1310.4546, https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf reader: https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/__init__.py license: not found base_dataset: Google News (about 100 billion words)
14Mb
=> 14MB
(unless you really mean megabits instead of megabytes, which doesn't sounds like a good idea)
I don't see any fundamental difference between corpora and models, both should have parameters
IMO (both are created using specific parameter choices).
For license not found
: that means we have no right whatsoever (by default). Can you ask the authors for permission?
@piskvorky All used & shared 20news and text8 (very popular small text datasets), I think that we can too (no need to ask about this concrete case, but typically needs I agree).
OK.
One more request: please add a link to the "reader" implementation (link to the concrete github source code) to the metadata, as a new field.
It will serve both as an "example" to make the format clearer (so we don't have to describe it ourselves -- Python is "executable pseudocode"), as well as a reminder that we want a reader implementation always.
Looks good, thanks!
Some of the readers have broken formatting (trailing whitespace, extra newlines at the end). Plus run a code style check (same guidelines as gensim).
And let's add a very clear and prominent warning into the README that each dataset comes with its own original license, which the users should study before using the dataset. We are not the copyright holders, and are not responsible for any potential license breaches by users.
What's done: