kevinscaria / InstructABSA

Instructional learning for Aspect Based Sentiment Analysis [NAACL-2024]
https://aclanthology.org/2024.naacl-short.63/
MIT License
147 stars 24 forks source link

Miss match seperater in `get_metrics` function #7

Closed dannhh closed 1 year ago

dannhh commented 1 year ago

I think there is a small mismatch seperater between create_data_in_joint_task_format function and get_metrics function:

Can you help me check if there is a typo?

Many thanks, Dan

Lurkhunter commented 1 year ago

hi, i would like to discuss this issue with you and also ask for my doubts about some of the code

Thanks~

kevinscaria commented 1 year ago

@Lurkhunter , Since this is a generative model, it might produce some tokens that it feels are additionally right.

Upon further analysis of such text inputs, we believe that the text generated should be classified as completely wrong for only those samples that generate outputs that are entirely wrong. But in the example provided by you: glass of wine/ wine is not wrong. The model has extracted the key aspect required.

If the model generates wine/glass of wine and the aspect polarity is also correct, the model is doing its job. The correct metric for comparing generative models is the ROUGE score. Other approaches that use generative models follow a token classification approach so using P, R, and F1 is intuitive. But since this approach is purely generative, we believe the penalization steps to evaluate our model are sound.

However, for your use case, if you believe it should be equal to instead of the "in" operator, you can go modify it in your forked version of the repository. Hope it helps.

Best, KJS

kevinscaria commented 1 year ago

@dannhh I will correct that. Thank you for bringing it to my notice. I added that separator to adjust for a small data sample when I was testing. But it's not generalized, so I will push the updated version of the code by Sunday.

Best, KJS

Lurkhunter commented 1 year ago

@Lurkhunter , Since this is a generative model, it might produce some tokens that it feels are additionally right.

Upon further analysis of such text inputs, we believe that the text generated should be classified as completely wrong for only those samples that generate outputs that are entirely wrong. But in the example provided by you: glass of wine/ wine is not wrong. The model has extracted the key aspect required.

If the model generates wine/glass of wine and the aspect polarity is also correct, the model is doing its job. The correct metric for comparing generative models is the ROUGE score. Other approaches that use generative models follow a token classification approach so using P, R, and F1 is intuitive. But since this approach is purely generative, we believe the penalization steps to evaluate our model are sound.

However, for your use case, if you believe it should be equal to instead of the "in" operator, you can go modify it in your forked version of the repository. Hope it helps.

Best, KJS

@kevinscaria Thank you for your reply. The above example is only one scenario during the prediction process. If the predicted result is glass of or of or even They would not even let me finish my glass of wine before offering another. The judgment conditions you set are still met. How can the tokens generated by the generative model be valid?

Best~