s4zong / extract_COVID19_events_from_Twitter

Annotated corpus and code for "Extracting COVID-19 Events from Twitter".
GNU General Public License v3.0
46 stars 17 forks source link

Micro F1 evaluation #14

Closed Chacha-Chen closed 4 years ago

Chacha-Chen commented 4 years ago

Hi Shi,

I noticed that in you paper, the calculation of micro-F1 eliminates several task, as below,

tested_negative age
tested_negative close_contact
tested_negative when
-- --
can_not_test when
-- --
death symptoms
-- --

Those tasks appear to be the BERT output F1 is equal to zero.

In the final evaluation, will you account for al tasks? or eliminate those and calculate the rest as shown in your paper?

Thanks.

s4zong commented 4 years ago

We eliminate these not because BERT output F1 to be 0, but these are having too few annotations in the test set (see footnote 7).

I am still working on the test set. I think we will probably exclude some tasks with too few annotations in the test set. And it will be announced before the evaluation period.

Chacha-Chen commented 4 years ago

Thanks for the timely response.

A following question is will you split the test dataset into 5 categories as the training dataset? (positive, negative, cannot test, cure, and death) In this case, we need to submit 5 prediction json files.

I am trying to figure out whether we should build separate models for separate categories or just build a shared model to predict subtask age that works for positive.age and negative.age as well.

The submitted prediciton json files for evaluation should be different for different categories. Am I understanding right?

Sorry that I keep bugging you with all these questions and thanks for the clarification in advance. 😺

s4zong commented 4 years ago

Thanks for pointing this out. We will provide separate files for each category (just as the provided training data). So I guess submitting different files would be easier (I will add this).

No problem. Please feel free to ask any questions.