Accuracy of baseline values

MinionAttack commented 3 years ago

Hi,

I am using the latest code available in the repo and have trained both baselines (graph parser and sequence labeling). Then I have measured the Sentiment Tuple F1 with the dev.json file and I am getting this values:

	Graph Parsing	Sequence Labeling
Darmstadt Unis	0.077	0.107
MPQA	0.141	0.016
Multibooked (CA)	0.534	0.286
Multibooked (EU)	0.545	0.372
Norec	0.296	0.190
Opener (EN)	0.546	0.339
Opener (ES)	0.536	0.328

For Graph Parsing, the values are around 0.5xx except for Darmstadt Unis, MPQA and Norec which are lower, specially Darmstadt Unis. For Sequence Labeling, the values are around 0.3xx except for Darmstadt Unis, MPQA and Norec which are lower, specially MPQA.

Are those values correct to be taken as a reference or I am doing something wrong training the models or when I do the inference to get the scores?

Regards.

jerbarnes commented 3 years ago

Hi Iago,

The values look pretty normal, except for Darmstadt with the Graph Parsing approch. The OpeNER/Multibooked datasets are right where I would expect, Norec is always a bit lower because it's a more diverse dataset and MPQA is quite hard because of the ambiguity of many of the polar expressions and size of the holders and targets. But Darmstadt using the graph parser should be higher than MPQA.

MinionAttack commented 3 years ago

Thanks for the clarification, the issue could be because of this change? https://github.com/jerbarnes/semeval22_structured_sentiment/issues/9

jerbarnes commented 3 years ago

I doubt it. These issues aren't so common as to cause a large drop in performance, and the original paper where we take the baseline from (https://aclanthology.org/2021.acl-long.263/) used the same data. Perhaps it's the effect of a particularly poor random seed, as there is a bit of variance (+-2.0 in the paper)??

MinionAttack commented 3 years ago

Ah, ok, I'll try to train it again. Thank you.

jerbarnes / semeval22_structured_sentiment

Accuracy of baseline values #13