Miss match seperater in `get_metrics` function

dannhh commented 1 year ago

I think there is a small mismatch seperater between create_data_in_joint_task_format function and get_metrics function:

In create_data_in_joint_task_format function, data is joined by ','

df['labels'] = df[aspect_col].apply(lambda x: ','.join([f"{i[key]}:{i[label_key]}" for i in x]))

In get_metrics function, data is splitted by ', ':
```
 gt_list = gt.split(', ')
```

Can you help me check if there is a typo?

Many thanks, Dan

Lurkhunter commented 1 year ago

hi， i would like to discuss this issue with you and also ask for my doubts about some of the code

I also want to inquire about these code blocks, and in addition, I would like to know if get_metrics function be more accurate in the following situations, resulting in higher results, for example:
```
input: They wouldnt even let me finish my glass of wine before offering another.
output: glass of wine:neutral
predict: wine:neutral
```
In similar situations, we tend to predict the joint task as' wine: net ', but in get_metrics function, it uses "in instead of ==, and I think it ignores the prefix, so the result of the "ATE task" would be better:
```
for gt_val in gt_list:
for pred_val in pred_list:
    # if pred_val.lower() == gt_val.lower() or gt_val.lower() == pred_val.lower():  
    if pred_val.lower() in gt_val.lower() or gt_val.lower() in pred_val.lower():        
        tp+=1
        break
```
May I ask if you can answer my doubts？ @dannhh @kevinscaria

Thanks~

kevinscaria commented 1 year ago

@Lurkhunter , Since this is a generative model, it might produce some tokens that it feels are additionally right.

Upon further analysis of such text inputs, we believe that the text generated should be classified as completely wrong for only those samples that generate outputs that are entirely wrong. But in the example provided by you: glass of wine/ wine is not wrong. The model has extracted the key aspect required.

If the model generates wine/glass of wine and the aspect polarity is also correct, the model is doing its job. The correct metric for comparing generative models is the ROUGE score. Other approaches that use generative models follow a token classification approach so using P, R, and F1 is intuitive. But since this approach is purely generative, we believe the penalization steps to evaluate our model are sound.

However, for your use case, if you believe it should be equal to instead of the "in" operator, you can go modify it in your forked version of the repository. Hope it helps.

Best, KJS

kevinscaria commented 1 year ago

@dannhh I will correct that. Thank you for bringing it to my notice. I added that separator to adjust for a small data sample when I was testing. But it's not generalized, so I will push the updated version of the code by Sunday.

Best, KJS

Lurkhunter commented 1 year ago

@Lurkhunter , Since this is a generative model, it might produce some tokens that it feels are additionally right.

Upon further analysis of such text inputs, we believe that the text generated should be classified as completely wrong for only those samples that generate outputs that are entirely wrong. But in the example provided by you: glass of wine/ wine is not wrong. The model has extracted the key aspect required.

If the model generates wine/glass of wine and the aspect polarity is also correct, the model is doing its job. The correct metric for comparing generative models is the ROUGE score. Other approaches that use generative models follow a token classification approach so using P, R, and F1 is intuitive. But since this approach is purely generative, we believe the penalization steps to evaluate our model are sound.

However, for your use case, if you believe it should be equal to instead of the "in" operator, you can go modify it in your forked version of the repository. Hope it helps.

Best, KJS

@kevinscaria Thank you for your reply. The above example is only one scenario during the prediction process. If the predicted result is glass of or of or even They would not even let me finish my glass of wine before offering another. The judgment conditions you set are still met. How can the tokens generated by the generative model be valid？

Best~

kevinscaria / InstructABSA

Miss match seperater in `get_metrics` function #7