Evaluate ATE and ASC separetly

lixin4ever / BERT-E2E-ABSA

[EMNLP 2019 Workshop] Exploiting BERT for End-to-End Aspect-based Sentiment Analysis

https://arxiv.org/abs/1910.00883

Apache License 2.0

396 stars 89 forks source link

Evaluate ATE and ASC separetly #35

Open yassmine-lam opened 3 years ago

yassmine-lam commented 3 years ago

Hi,

Thank u for sharing ur code with us. I have a question about predictions. As I read in ur paper, u reported a single result for these two tasks. However, is it possible to return evaluation scores for each task separately (As u did in this code https://github.com/lixin4ever/E2E-TBSA)? So we can compare this model against single-task models?

Thank u

lixin4ever commented 3 years ago

In this repo, we propose to handle the E2E-ABSA problem using a sequence tagging model. Since ATE can be formulated as a sequence tagging task, you can evaluate the ATE performance by simply degrading the predicted tags and the gold standard tags of the E2E-ABSA task. However, as ASC is a typical classification task, evaluating its performance in our sequence tagging model is not that straightforward.

yassmine-lam commented 3 years ago

Thank u for ur reply. Yes, you are right, but I was wondering how did u do to report two results, one for ATE and the other for ASC in this repo https://github.com/lixin4ever/E2E-TBSA). As I understood, u separated the tags of each task and evaluated their results separately? Is that true? If yes, it is possible to do the same here, isn't it?

Sorry, I feel confused, so If u could explain this to me, I will be grateful.

Thank u

lixin4ever commented 3 years ago

First of all, I want to clarify that ASC, a typical classification task, is different from E2E-ABSA (or "targeted sentiment analysis", "E2E-TBSA"), which is formulated as a sequence tagging task in our paper. In this issue, I mistakenly told you that "targeted sentiment analysis" is equivalent to "aspect sentiment classification" and I think that's the point leading to part of your confusion (sorry for the misinformation).

Return to your question, in another work, namely https://github.com/lixin4ever/E2E-TBSA, the reason we can report the results of ATE and E2E-ABSA is that it is a multi-task learning framework, and ATE predictions are explicitly provided. In order to report the ATE performance in this repo, you may need to degrade the predicted/gold tags of E2E-ABSA, i.e., only preserve the boundary tag and ignore the sentiment tag, and then do evaluation.

yassmine-lam commented 3 years ago

Thank u very much for the detailed answer; it is more clear to me now. I really appreciate your effort and time in answering our questions. Another question plz. For a sequence labeling model, people are generally using seqevel for evaluation, but in ur code, u used sklearn metrics; what is the difference between these two frameworks? And what is the most suitable one for this task (E2E-ABSA)?

Thank u in advance