s4zong / extract_COVID19_events_from_Twitter

Annotated corpus and code for "Extracting COVID-19 Events from Twitter".
GNU General Public License v3.0
46 stars 17 forks source link

Will you release the testing dataset? #3

Closed Chacha-Chen closed 4 years ago

Chacha-Chen commented 4 years ago

Hi Shi,

In the evaluation phase, will you provide the testing dataset for teams to make predictions? The submission format is a model file or a file that contains predictions?

Thanks. Chacha

s4zong commented 4 years ago

Hi,

Yes, I think we will be releasing the testing dataset.

We specify the submission format at https://github.com/viczong/extract_COVID19_events_from_Twitter/tree/master/shared_task. It is a file containing predictions.

Thanks,

Chacha-Chen commented 4 years ago

Sorry in advance for the long question, I hope I can articulate myself.

For the testing set, are the part1.responses for all the tweets "yes"?

My concern is, if there are tweets in the testing set that have the part1.response=No, then all the part2.reponses would be empty. Shall we design our model to identify first whether it is a valid tweet then deal with the slot filling task, or just work on the slot filling task? If our model is entirely for the slot filling task, when it encounters any tweets that has part1.reponses=No, but it gave predictions for the part2.responses, while the groundtruth are all empty, how would the evaluation be in this case?

Thanks so much.

s4zong commented 4 years ago

The testing set will be a mix of tweets with part1.responses marked as "yes" and "no". The data will be the same as the current training data release, but just sampled from a different period of time. Current training tweets are sampled until 4.26, the testing tweets are sampled from 4.27-6.27. Keywords used to gather tweets are the same. We don't tune the ratio of tweets that marked as "yes" or "no" for the first question to reflect the actual distribution.

When designing the model, we could either use a cascade model that first identifies if it is a valid tweet, then makes predictions on the slot filling questions, or could joint train the model by combining these two parts together. In our experiments, we take the second approach.

For evaluation, we will just compare the predictions with the golden labels in the following way: Take every span that is predicted by the model, and every gold span in the data. Then we calculate the number of True Positives, False Positives and False negatives. True positives are predicted spans that appear in the gold labels. False positives are predicted spans that don't appear in the gold labels. False negatives are gold spans that weren't in the set of spans predicted by the model. So if the model gives a prediction for the part2.responses when part1.responses is No, then this prediction will be marked as a false positive case.

Chacha-Chen commented 4 years ago

It has been super helpful. Great thx!