Closed menshikh-iv closed 6 years ago
Table with main metrics
According to Section 5 of the 2016 task paper linked in section “Papers” of #18, the main evaluation metric is MAP (Mean Average Precision). Supplementary evaluation metrics include Mean Reciprocal Rank (MRR), Average Recall (AvgRec), Precision, Recall, F1, and Accuracy.
Along with the updated datatype of the RELQ_RANKING_ORDER
field, which I proposed in #18 and which you may or may not include, since the impact of the update is minor, I also have the following name change to propose:
-eng
suffix; despite my original belief, the rest of the name should be sufficient to identify the language of the datasets,subtaskB
suffix to subtaskBC
; it appears that the dataset can also be used for Subtask C.I apologize for these late changes.
"description" for both datasets
SemEval 2016 / 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling.
SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the 2016 task paper linked in section “Papers” of #18.
"fields" for taskB
The main data field for Subtask B is RELQ_RELEVANCE2ORGQ
, and the main data field for Subtask C is RELC_RELEVANCE2ORGQ
. The purpose of the numerous supplementary fields is described in Section 4.1 of the 2016 task paper linked in section “Papers” of #18.
Full code example
Using the Subtask A unannotated dataset, we build a corpus:
import gensim.downloader as api
from gensim.utils import simple_preprocess
corpus = []
for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated"):
corpus.append(simple_preprocess(thread["RelQuestion"]["RelQSubject"]))
corpus.append(simple_preprocess(thread["RelQuestion"]["RelQBody"]))
for relcomment in thread["RelComments"]:
corpus.append(simple_preprocess(relcomment["RelCText"]))
The below code example for Subtasks B and C and takes the corpus we have just built. For each original thread, we then extract the question from the original thread and compare it against the questions in the related threads (for subtask B) and comments in the related threads (for subtask C) using cosine similarity. This produces rankings that we evaluate using the Mean Average Precision (MAP) evaluation metric.
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.similarities import MatrixSimilarity
from gensim.utils import simple_preprocess
import numpy as np
corpus = []
for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated"):
corpus.append(simple_preprocess(thread["RelQuestion"]["RelQSubject"]))
corpus.append(simple_preprocess(thread["RelQuestion"]["RelQBody"]))
for relcomment in thread["RelComments"]:
corpus.append(simple_preprocess(relcomment["RelCText"]))
dictionary = Dictionary(corpus)
datasets = api.load("semeval-2016-2017-task3-subtaskBC")
def produce_test_data(dataset):
for orgquestion in datasets[dataset]:
relquestions = [
(
dictionary.doc2bow(
simple_preprocess(thread["RelQuestion"]["RelQSubject"]) \
+ simple_preprocess(thread["RelQuestion"]["RelQBody"])),
thread["RelQuestion"]["RELQ_RELEVANCE2ORGQ"] \
in ("PerfectMatch", "Relevant"))
for thread in orgquestion["Threads"]]
relcomments = [
(
dictionary.doc2bow(simple_preprocess(relcomment["RelCText"])),
relcomment["RELC_RELEVANCE2ORGQ"] == "Good")
for thread in orgquestion["Threads"]
for relcomment in thread["RelComments"]]
orgquestion = dictionary.doc2bow(
simple_preprocess(orgquestion["OrgQSubject"]) \
+ simple_preprocess(orgquestion["OrgQBody"]))
yield (orgquestion, dict(subtaskB=relquestions, subtaskC=relcomments))
def average_precision(similarities, relevance):
precision = [
(num_correct + 1) / (num_total + 1) \
for num_correct, num_total in enumerate(
num_total for num_total, (_, relevant) in enumerate(
sorted(zip(similarities, relevance), reverse=True)) \
if relevant)]
return np.mean(precision) if precision else 0.0
def evaluate(dataset, subtask):
results = []
for orgquestion, subtasks in produce_test_data(dataset):
documents, relevance = zip(*subtasks[subtask])
index = MatrixSimilarity(documents, num_features=len(dictionary))
similarities = index[orgquestion]
assert len(similarities) == len(documents)
results.append(average_precision(similarities, relevance))
return np.mean(results) * 100.0
for dataset in ("2016-dev", "2016-test", "2017-test"):
print("MAP score on the %s dataset:\t%.02f (Subtask B)\t%.02f (Subtask C)" % (
dataset, evaluate(dataset, "subtaskB"), evaluate(dataset, "subtaskC")))
The above code produces the following output for me:
MAP score on the 2016-dev dataset: 66.87 (Subtask B) 16.65 (Subtask C)
MAP score on the 2016-test dataset: 69.51 (Subtask B) 21.94 (Subtask C)
MAP score on the 2017-test dataset: 41.06 (Subtask B) 6.42 (Subtask C)
re-generate table for
README
Can I help with this?
Last needed changes
taskB
taskB
with evaluation, we can combine it withtaskA
README
CC: @Witiko