Closed Chacha-Chen closed 4 years ago
We eliminate these not because BERT output F1 to be 0, but these are having too few annotations in the test set (see footnote 7).
I am still working on the test set. I think we will probably exclude some tasks with too few annotations in the test set. And it will be announced before the evaluation period.
Thanks for the timely response.
A following question is will you split the test dataset into 5 categories as the training dataset? (positive, negative, cannot test, cure, and death) In this case, we need to submit 5 prediction json files.
I am trying to figure out whether we should build separate models for separate categories or just build a shared model to predict subtask age that works for positive.age and negative.age as well.
The submitted prediciton json files for evaluation should be different for different categories. Am I understanding right?
Sorry that I keep bugging you with all these questions and thanks for the clarification in advance. 😺
Thanks for pointing this out. We will provide separate files for each category (just as the provided training data). So I guess submitting different files would be easier (I will add this).
No problem. Please feel free to ask any questions.
Hi Shi,
I noticed that in you paper, the calculation of micro-F1 eliminates several task, as below,
Those tasks appear to be the BERT output F1 is equal to zero.
In the final evaluation, will you account for al tasks? or eliminate those and calculate the rest as shown in your paper?
Thanks.