Shimorina and Belz 2022 The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP ask to report statistical power of the data sample used in human evaluation. This is particularly important when the sample is small. Our cloze test with 100 strings clearly falls into this category. It may also be good to measure the statistical power of test sets used in automatic evaluation, parsing and MWE tagging in our paper.
In case of dependency relation prediction and sequence tagging, it may not be straight forward to apply standard formulae as the predictions for each item in a sequence are not independent.
Shimorina and Belz 2022 The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP ask to report statistical power of the data sample used in human evaluation. This is particularly important when the sample is small. Our cloze test with 100 strings clearly falls into this category. It may also be good to measure the statistical power of test sets used in automatic evaluation, parsing and MWE tagging in our paper.
In case of dependency relation prediction and sequence tagging, it may not be straight forward to apply standard formulae as the predictions for each item in a sequence are not independent.
More reading: