Closed AChen-qaq closed 2 years ago
Hi, Thank you very much for bringing up this issue! I have been also a little bit puzzled to see them different from what we used. About the one case where 1-shot performance was higher than 5-shot, I remember being a little surprised to see that one as well.
After a quick investigation, I have found that both the paper and the website results have changed since we wrote the paper. All our results are directly from the version: Few-NERD. I remember checking the benchmark website close to the end of last year and saw these same results. Looks like both the benchmark results and the paper has changed since then. Interestingly, these updated results may make CONTaiNER come out on top in both 10-way 1-2 shot and 5-way 1-2 shot as well in Few-NERD inter task. But it needs more investigation to see if that's the case.
I notice that there is a gap between FewNERD official baselines and the ones that come from the performance reported in your paper. Specifically, in the INTER 5 way 1-2 shot setting, ProtoBERT got 44.44 F1 score in your paper while in the FewNERD official benchmarks, it got just 38.83 F1 score. In fact, not only the ProtoBERT model, but all models in the INTER 5 way 1-2 shot setting have the same issue. We can see that in INTER 5 way 1-2 shot setting, StructShot has a F1 score of 57.33, which even exceeds the performance reported in the INTER 5way 5-10 shot setting. Why is this possible? Are there some misunderstandings or the benchmarks just have some problems?