Closed Witiko closed 6 years ago
Nice! :)
@menshikh-iv I pushed an updated semeval-2016_2017-task3-subtaskB-english.json.gz, which now contains the RELQ_RANKING_ORDER
field as an integer rather than a string. It is a minor but convenient change.
Can this dataset be used directly as follows below?
import gensim
import gensim.downloader as api
corpus = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
word2vec = gensim.models.Word2Vec(corpus)
I am getting a strange output on checking the vocab word2vec.wv.vocab
:
{'RelComments': <gensim.models.keyedvectors.Vocab at 0x7f1740e26a90>,
'RelQuestion': <gensim.models.keyedvectors.Vocab at 0x7f16fad64cf8>,
'THREAD_SEQUENCE': <gensim.models.keyedvectors.Vocab at 0x7f173ee5f128>}
@AMR-KELEG The dataset is not a corpus. You will need to extract the text data you are interested in:
import gensim
import gensim.downloader as api
questions = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
corpus = [question["RelQuestion"]["RelQBody"] for question in questions]
Introduction
I converted SemEval 2016 and 2017 question answering datasets into JSON for ease of use. The original datasets are in XML and scattered across several ZIP archives. The JSON files are going to be immediately used in the Gensim documentation for the Soft Cosine Measure (see the respective pull request).
Description
Datasets
semeval-2016_2017-task3-subtaskA-unannotated-english.json.gz (231M) – Example:
Papers
Code
License
These are the licensing notices found in the individual ZIP files with the original XML datasets: