thundergolfer / reasoning-about-entailment-tensorflow

:school: Tensorflow implementation of "Reasoning About Entailment with Neural Attention"
MIT License
10 stars 5 forks source link

Get the old SICK dataset incorporated #6

Open thundergolfer opened 7 years ago

thundergolfer commented 7 years ago

The Sentences Involving Compositional Knowledge dataset was an older dataset for Natural Language Inference tasks.

From the webpage:

The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description data set. We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step generation process: first, the original sentences were normalized to remove unwanted linguistic phenomena; the normalized sentences were then expanded to obtain up to three new sentences with specific characteristics suitable to CDSM evaluation; as a last step, all the sentences generated in the expansion phase were paired with the normalized sentences in order to obtain the final data set.

Though it's inferior to, and been superseded by, the SNLI dataset, it might be interesting enough to try and see how much worse it is to train using this dataset.

thundergolfer commented 7 years ago

7 has started this off