piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
980 stars 131 forks source link

SemEval 2016/2017 Task 3, English Subtask A unannotated datasets and English Subtask B datasets #18

Closed Witiko closed 6 years ago

Witiko commented 6 years ago

Introduction

I converted SemEval 2016 and 2017 question answering datasets into JSON for ease of use. The original datasets are in XML and scattered across several ZIP archives. The JSON files are going to be immediately used in the Gensim documentation for the Soft Cosine Measure (see the respective pull request).

Description

Community Question Answering (CQA) forums are gaining popularity online. They are seldom moderated, rather open, and thus they have few restrictions, if any, on who can post and who can answer a question. On the positive side, this means that one can freely ask any question and expect some good, honest answers. On the negative side, it takes effort to go through all possible answers and to make sense of them. For example, it is not unusual for a question to have hundreds of answers, which makes it very time consuming to the user to inspect and to winnow. The challenge we propose may help automate the process of finding good answers to new questions in a community-created discussion forum (e.g., by retrieving similar questions in the forum and identifying the posts in the answer threads of those questions that answer the question well).

We build on the success of the previous editions of our SemEval tasks on CQA, SemEval-2015 Task 3 and SemEval-2016 Task 3, and present an extended edition for SemEval- 2017, which incorporates several novel facets.

Datasets

Papers

Code

License

These are the licensing notices found in the individual ZIP files with the original XML datasets:

piskvorky commented 6 years ago

Nice! :)

Witiko commented 6 years ago

@menshikh-iv I pushed an updated semeval-2016_2017-task3-subtaskB-english.json.gz, which now contains the RELQ_RANKING_ORDER field as an integer rather than a string. It is a minor but convenient change.

AMR-KELEG commented 5 years ago

Can this dataset be used directly as follows below?

import gensim
import gensim.downloader as api

corpus = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
word2vec = gensim.models.Word2Vec(corpus)

I am getting a strange output on checking the vocab word2vec.wv.vocab:

{'RelComments': <gensim.models.keyedvectors.Vocab at 0x7f1740e26a90>,
 'RelQuestion': <gensim.models.keyedvectors.Vocab at 0x7f16fad64cf8>,
 'THREAD_SEQUENCE': <gensim.models.keyedvectors.Vocab at 0x7f173ee5f128>}
Witiko commented 5 years ago

@AMR-KELEG The dataset is not a corpus. You will need to extract the text data you are interested in:

import gensim
import gensim.downloader as api

questions = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
corpus = [question["RelQuestion"]["RelQBody"] for question in questions]