piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 127 forks source link

SNLI(Stanford Natural Language Inference) corpus #32

Open aneesh-joshi opened 5 years ago

aneesh-joshi commented 5 years ago

Dataset download link : https://nlp.stanford.edu/projects/snli/snli_1.0.zip

Dataset website : https://nlp.stanford.edu/projects/snli/

Paper : https://nlp.stanford.edu/pubs/snli_paper.pdf

Brief Description (from website):

The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, as well as a resource for developing NLP models of any kind.

Example Datapoints:

Text Judgments Hypothesis
A man inspects the uniform of a figure in some East Asian country. contradictionC C C C C The man is sleeping
An older and younger man smiling. neutralN N E N N Two men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people. contradictionC C C C C A man is driving down a lonely road.
A soccer game with multiple males playing. entailmentE E E E E Some men are playing a sport.
A smiling costumed woman is holding an umbrella. neutralN N E C N A happy woman in a fairy costume holds an umbrella.

My Script for reading:

import json
import os
import re
from keras.utils.np_utils import to_categorical

class SnliReader:
    """Reader for the SNLI dataset
    More details can be found here : https://nlp.stanford.edu/projects/snli/

    Each data point contains 2 sentences and their label('contradiction', 'entailment', 'neutral')
    Additionally, it also provides annotator labels which has a range of labels given by the annotators. We will mostly ignore this.

    Example datapoint:
    gold_label  sentence1_binary_parse  sentence2_binary_parse  sentence1_parse sentence2_parse sentence1   sentence2   captionID   pairID  label1  label2  label3  label4  label5
    neutral ( ( Two women ) ( ( are ( embracing ( while ( holding ( to ( go packages ) ) ) ) ) ) . ) )  ( ( The sisters ) ( ( are ( ( hugging goodbye ) ( while ( holding ( to ( ( go packages ) ( after ( just ( eating lunch ) ) ) ) ) ) ) ) ) . ) )  (ROOT (S (NP (CD Two) (NNS women)) (VP (VBP are) (VP (VBG embracing) (SBAR (IN while) (S (NP (VBG holding)) (VP (TO to) (VP (VB go) (NP (NNS packages)))))))) (. .)))   (ROOT (S (NP (DT The) (NNS sisters)) (VP (VBP are) (VP (VBG hugging) (NP (UH goodbye)) (PP (IN while) (S (VP (VBG holding) (S (VP (TO to) (VP (VB go) (NP (NNS packages)) (PP (IN after) (S (ADVP (RB just)) (VP (VBG eating) (NP (NN lunch))))))))))))) (. .)))    Two women are embracing while holding to go packages.   The sisters are hugging goodbye while holding to go packages after just eating lunch.   4705552913.jpg#2    4705552913.jpg#2r1n neutral entailment  neutral neutral neutral

    Parameters
    ----------
    filepath : str
        path to the folder with the snli data

    """

    def __init__(self, filepath):
        self.filepath = filepath
        self.filename = {}
        self.filename['train'] = 'snli_1.0_train.jsonl'
        self.filename['dev'] = 'snli_1.0_dev.jsonl'
        self.filename['test'] = 'snli_1.0_test.jsonl'
        self.label2index = {'contradiction': 0, 'entailment': 1, 'neutral': 2}

    def get_data(self, split):
        """Returnd the data for the given split

        Parameters
        ----------
        split : {'train', 'test', 'dev'}
            The split of the data

        Returns
        -------
        sentA_datalist, sentB_datalist, lablels, annotator_labels
        """
        x1, x2, labels, annotator_labels = [], [], [], []
        with open(os.path.join(self.filepath, self.filename[split]), 'r') as f:
            for line in f:
                line = json.loads(line)
                if line['gold_label'] == '-':
                    # In the case of this unknown label, we will skip the whole datapoint
                    continue
                x1.append(self._preprocess(line['sentence1']))
                x2.append(self._preprocess(line['sentence2']))
                labels.append(self.label2index[line['gold_label']])

                annotator_labels.append(line['annotator_labels'])
        return x1, x2, labels, annotator_labels

    def _preprocess(self, sent):
        """lower, strip and split the string and remove unnecessaey characters

        Parameters
        ----------
        sent : str
            The sentence to be preprocessed
        """
        return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()

    def get_label2index(self):
        """Returns the label2index dict"""
        return self.label2index
adikolsur commented 5 years ago

@aneesh-joshi Can you please explain a bit about how to use this script and its functioning?

aneesh-joshi commented 5 years ago

Hi @adikolsur , could you mention which parts are unclear? Have you taken a look at comments and links (to the paper and website)?

FarhatAbdullah commented 4 years ago

Hi, Can someone guide me how can I find Urdu corpus on this database?