Closed dannhh closed 1 year ago
hi, i would like to discuss this issue with you and also ask for my doubts about some of the code
get_metrics
function be more accurate in the following situations, resulting in higher results, for example:
input: They wouldnt even let me finish my glass of wine before offering another.
output: glass of wine:neutral
predict: wine:neutral
get_metrics
function, it uses "in
instead of ==
, and I think it ignores the prefix, so the result of the "ATE task" would be better:
for gt_val in gt_list:
for pred_val in pred_list:
# if pred_val.lower() == gt_val.lower() or gt_val.lower() == pred_val.lower():
if pred_val.lower() in gt_val.lower() or gt_val.lower() in pred_val.lower():
tp+=1
break
May I ask if you can answer my doubts? @dannhh @kevinscaria
Thanks~
@Lurkhunter , Since this is a generative model, it might produce some tokens that it feels are additionally right.
Upon further analysis of such text inputs, we believe that the text generated should be classified as completely wrong for only those samples that generate outputs that are entirely wrong. But in the example provided by you: glass of wine/ wine is not wrong. The model has extracted the key aspect required.
If the model generates wine/glass of wine and the aspect polarity is also correct, the model is doing its job. The correct metric for comparing generative models is the ROUGE score. Other approaches that use generative models follow a token classification approach so using P, R, and F1 is intuitive. But since this approach is purely generative, we believe the penalization steps to evaluate our model are sound.
However, for your use case, if you believe it should be equal to instead of the "in" operator, you can go modify it in your forked version of the repository. Hope it helps.
Best, KJS
@dannhh I will correct that. Thank you for bringing it to my notice. I added that separator to adjust for a small data sample when I was testing. But it's not generalized, so I will push the updated version of the code by Sunday.
Best, KJS
@Lurkhunter , Since this is a generative model, it might produce some tokens that it feels are additionally right.
Upon further analysis of such text inputs, we believe that the text generated should be classified as completely wrong for only those samples that generate outputs that are entirely wrong. But in the example provided by you: glass of wine/ wine is not wrong. The model has extracted the key aspect required.
If the model generates wine/glass of wine and the aspect polarity is also correct, the model is doing its job. The correct metric for comparing generative models is the ROUGE score. Other approaches that use generative models follow a token classification approach so using P, R, and F1 is intuitive. But since this approach is purely generative, we believe the penalization steps to evaluate our model are sound.
However, for your use case, if you believe it should be equal to instead of the "in" operator, you can go modify it in your forked version of the repository. Hope it helps.
Best, KJS
@kevinscaria Thank you for your reply. The above example is only one scenario during the prediction process. If the predicted result is glass of
or of
or even They would not even let me finish my glass of wine before offering another.
The judgment conditions you set are still met. How can the tokens generated by the generative model be valid?
Best~
I think there is a small mismatch seperater between
create_data_in_joint_task_format
function andget_metrics
function:In create_data_in_joint_task_format function, data is joined by
','
In get_metrics function, data is splitted by
', '
:Can you help me check if there is a typo?
Many thanks, Dan