piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Fill-up release notes #14

Closed menshikh-iv closed 6 years ago

menshikh-iv commented 6 years ago

Need to fill all release notes for all data (and fix current descriptions). Motivation - have a distinct description for each dataset (because table in README isn't readable) + we need to have a good link to data for promotion reasons.

Good examples:

@chaitaliSaini can you do this?

chaitaliSaini commented 6 years ago

Yes, i'll do it.

chaitaliSaini commented 6 years ago

Incomplete datasets/models:

  1. 20-newsgroups(no example)
    import gensim.downloader as api
    import json
    newsgroups_dataset = api.load("20-newsgroups")
    for doc in newsgroups_dataset:
    print(json.dumps(doc, indent=4))
    break
    """
    Output:
    {
    "set": "train",
    "data": "From: db7n+@andrew.cmu.edu (D. Andrew Byler)\nSubject: Re: Serbian genocide Work of God?\nOrganization: Freshman, Civil Engineering, Carnegie Mellon, Pittsburgh, PA\nLines: 61\n\nVera Shanti Noyes writes;\n\n>this is what indicates to me that you may believe in predestination.\n>am i correct?  i do not believe in predestination -- i believe we all\n>choose whether or not we will accept God's gift of salvation to us.\n>again, fundamental difference which can't really be resolved.\n\nOf course I believe in Predestination.  It's a very biblical doctrine as\nRomans 8.28-30 shows (among other passages).  Furthermore, the Church\nhas always taught predestination, from the very beginning.  But to say\nthat I believe in Predestination does not mean I do not believe in free\nwill.  Men freely choose the course of their life, which is also\naffected by the grace of God.  However, unlike the Calvinists and\nJansenists, I hold that grace is resistable, otherwise you end up with\nthe idiocy of denying the universal saving will of God (1 Timothy 2.4). \nFor God must give enough grace to all to be saved.  But only the elect,\nwho he foreknew, are predestined and receive the grace of final\nperserverance, which guarantees heaven.  This does not mean that those\nwithout that grace can't be saved, it just means that god foreknew their\nobstinacy and chose not to give it to them, knowing they would not need\nit, as they had freely chosen hell.\n\t\t\t\t\t\t\t  ^^^^^^^^^^^\nPeople who are saved are saved by the grace of God, and not by their own\neffort, for it was God who disposed them to Himself, and predestined\nthem to become saints.  But those who perish in everlasting fire perish\nbecause they hardened their heart and chose to perish.  Thus, they were\ndeserving of God;s punishment, as they had rejected their Creator, and\nsinned against the working of the Holy Spirit.\n\n>yes, it is up to God to judge.  but he will only mete out that\n>punishment at the last judgement. \n\nWell, I would hold that as God most certainly gives everybody some\nblessing for what good they have done (even if it was only a little),\nfor those He can't bless in the next life, He blesses in this one.  And\nthose He will not punish in the next life, will be chastised in this one\nor in Purgatory for their sins.  Every sin incurs some temporal\npunishment, thus, God will punish it unless satisfaction is made for it\n(cf. 2 Samuel 12.13-14, David's sin of Adultery and Murder were\nforgiven, but he was still punished with the death of his child.)  And I\nneed not point out the idea of punishment because of God's judgement is\nquite prevelant in the Bible.  Sodom and Gommorrah, Moses barred from\nthe Holy Land, the slaughter of the Cannanites, Annias and Saphira,\nJerusalem in 70 AD, etc.\n\n> if jesus stopped the stoning of an adulterous woman (perhaps this is\nnot a >good parallel, but i'm going to go with it anyway), why should we\nnot >stop the murder and violation of people who may (or may not) be more\n>innocent?\n\nWe should stop the slaughter of the innocent (cf Proverbs 24.11-12), but\ndoes that mean that Christians should support a war in Bosnia with the\nU.S. or even the U.N. involved?  I do not think so, but I am an\nisolationist, and disagree with foreign adventures in general.  But in\nthe case of Bosnia, I frankly see no excuse for us getting militarily\ninvolved, it would not be a \"just war.\"  \"Blessed\" after all, \"are the\npeacemakers\" was what Our Lord said, not the interventionists.  Our\nactions in Bosnia must be for peace, and not for a war which is\nunrelated to anything to justify it for us.\n\nAndy Byler\n",
    "id": "21408",
    "topic": "soc.religion.christian"
    }
    """
  2. fake-news(no example)
    
    import gensim.downloader as api
    import json
    fake_news = api.load("fake-news")
    for doc in fake_news: 
    print(json.dumps(doc, indent=4))
    break

""" Output: { "comments": "0", "title": "Muslims BUSTED: They Stole Millions In Gov\u2019t Benefits", "published": "2016-10-26T21:41:00.000+03:00", "site_url": "100percentfedup.com", "language": "english", "text": "Print They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? \nHere we go again \u2026another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! \nWe\u2019ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system\u2026It\u2019s way out of control! More Related", "domain_rank": "25689", "crawled": "2016-10-27T01:49:27.168+03:00", "type": "bias", "likes": "0", "shares": "0", "spam_score": "0", "country": "US", "author": "Barracuda Brigade", "participants_count": "1", "ord_in_thread": "0", "thread_title": "Muslims BUSTED: They Stole Millions In Gov\u2019t Benefits", "uuid": "6a175f46bcd24d39b3e962ad0f29936721db70db", "main_img_url": "http://bb4sp.com/wp-content/uploads/2016/10/Fullscreen-capture-10262016-83501-AM.bmp.jpg", "replies_count": "0" } """

3. quora-duplicate-questions(no example + desc)

Feature | Description
------------ | -------------
Dataset | Quora's question pair dataset
Dataset description |  The dataset contains 400,000 question pairs. Each question pair has an id for both questions, full text of the questions and a binary value telling whether the question pair is duplicate or not. 
Link to dataset page| [First Quora Dataset: Question Pairs](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs)
Link to dataset |  [Download dataset](http://qim.ec.quoracdn.net/quora_duplicate_questions.tsv)

```python
import gensim.downloader as api
import json
quora_duplicate_ques_dataset = api.load("quora-duplicate-questions")
for question_pair in quora_duplicate_ques_dataset:
    print(json.dumps(question_pair, indent=4))
    break
"""
Output:
{
    "qid1": "1",
    "question2": "What is the step by step guide to invest in share market?",
    "qid2": "2",
    "is_duplicate": "0",
    "question1": "What is the step by step guide to invest in share market in india?",
    "id": "0"
}
"""
  1. wiki-english-20171001(no example) (can't download with my internet connection)
  2. word2vec-google-news-300(no example) (i don't have enough RAM to process it)
chaitaliSaini commented 6 years ago

@menshikh-iv Is there anything else that i need to add?

menshikh-iv commented 6 years ago

Thanks @chaitaliSaini, I updated all release notes + make missed additions (for example for glove* vectors) + small refactoring for all data.