kevinscaria / InstructABSA

Instructional learning for Aspect Based Sentiment Analysis [NAACL-2024]
https://aclanthology.org/2024.naacl-short.63/
MIT License
135 stars 22 forks source link

Experimental reproduction results #8

Closed Lurkhunter closed 1 year ago

Lurkhunter commented 1 year ago

I tried to build datasets for Laptop and Rest14, and then I conducted experiments on aspect term extraction on Laptop, different conditions were used to determine true positive in the get_metrics function, and the results are as follows: Complete:

if pred_val.lower() == gt_val.lower():  

Inomplete:

if pred_val.lower() in gt_val.lower() or gt_val.lower() in pred_val.lower():  

code: untils.py

for gt_val in gt_list:
    for pred_val in pred_list:
        # if pred_val.lower() == gt_val.lower() or gt_val.lower() == pred_val.lower():  
        if pred_val.lower() in gt_val.lower() or gt_val.lower() in pred_val.lower():        
            tp+=1
            break

image

kevinscaria commented 1 year ago

@Lurkhunter,

The numbers that you report are too low. It could be because the dataset that you are using is not processed exactly the same way as I did. The difference that I get is around 4% from the complete and incomplete match. I will be sharing the datasets in a while and you can try running it again.

Screen Shot 2023-04-21 at 13 20 02 PM

Also as I mentioned before, there is no reliable way to test the performance of the text generated. But since it is fine-tuned on a given dataset, we ground the possibilities of erroneous text being generated. During the deployment phase, the text generated would have to be further grounded by rules suiting the domain.

We believe the metric that we have defined is a bit unconstrained. However, it was finalized after empirically checking the outputs from different runs. I am attaching the output of {predicted --> ground truth} cases that fall in this gray area. I have manually tagged the

{'ADSl cable --> configure for ADSl cable or wifi', 'CPU --> fourth-generation Haswell CPU', #Probably Incorrect 'CPU --> third-generation CPU ("Ivy Bridge")', #Probably Incorrect 'Customization on mac --> Customization', 'Intel 4000 graphics --> integrated Intel 4000 graphics', 'Mac OS improvement --> Mac OS', 'Mountain Lion --> install Mountain Lion', #Incorrect 'Premium price for the OS --> OS', 'Premium price for the OS --> price', 'RAM --> 4G of RAM', 'RAM --> 8G of RAM', 'User upgradeable RAM --> RAM', 'Windows 7, --> Windows 7', 'Windows 8 and touchscreen functions --> Windows 8', 'Windows 8 and touchscreen functions --> touchscreen functions', 'bluetooth devices --> integrate bluetooth devices', 'bookmarks --> create your own bookmarks', 'brushed aluminum --> aluminum', 'build --> overall build', 'connectivity --> flexibility for connectivity', 'durability --> durability of the battery', #Incorrect 'extender cable --> cable', 'finger clicking --> two finger clicking', 'functions --> functions provided by the trackpad', #Incorrect 'games --> support for games', 'hard drive --> regular hard drive', 'hardware --> hardware (keyboard)', 'installation disk --> installation disk (DVD)', 'keys --> lit up keys', 'look --> looks', 'nail slot --> nail slot on the card', 'nail slot --> slot', 'performance --> performance and feature set of the hardware', 'performance,.20 inch thicker --> performance', #Incorrect 'plastic case --> slim plastic case', 'product quality,aesthetics,craftmanship --> aesthetics', 'product quality,aesthetics,craftmanship --> craftmanship', 'product quality,aesthetics,craftmanship --> product quality', 'programs --> Legacy programs', 'ram --> upgrade the ram', 'setting --> customize setting', 'slim profile --> profile', 'software --> install software', #Incorrect 'system --> log into the system', 'voice recording for my vlog --> voice recording', 'wireless Apple Keyboard --> wireless Apple Keyboard and Mouse'} #Incorrect

It is unfair to penalize all of these samples due to one token mismatch or 2-3 tokens being generated additionally in such cases. I have explicitly labeled #Incorrect for 3-4 samples that are clearly erroneous.

However, this approach is solely meant to show that instruction tuning a model improves performance. At this scale, it is not feasible to manually verify every sample. Thus we believe this generalized evaluation script handles these gray areas fairly without penalizing the model to a greater extent.

Lurkhunter commented 1 year ago

@Lurkhunter,

The numbers that you report are too low. It could be because the dataset that you are using is not processed exactly the same way as I did. The difference that I get is around 4% from the complete and incomplete match. I will be sharing the datasets in a while and you can try running it again.

Screen Shot 2023-04-21 at 13 20 02 PM

Also as I mentioned before, there is no reliable way to test the performance of the text generated. But since it is fine-tuned on a given dataset, we ground the possibilities of erroneous text being generated. During the deployment phase, the text generated would have to be further grounded by rules suiting the domain.

We believe the metric that we have defined is a bit unconstrained. However, it was finalized after empirically checking the outputs from different runs. I am attaching the output of {predicted --> ground truth} cases that fall in this gray area. I have manually tagged the

{'ADSl cable --> configure for ADSl cable or wifi', 'CPU --> fourth-generation Haswell CPU', #Probably Incorrect 'CPU --> third-generation CPU ("Ivy Bridge")', #Probably Incorrect 'Customization on mac --> Customization', 'Intel 4000 graphics --> integrated Intel 4000 graphics', 'Mac OS improvement --> Mac OS', 'Mountain Lion --> install Mountain Lion', #Incorrect 'Premium price for the OS --> OS', 'Premium price for the OS --> price', 'RAM --> 4G of RAM', 'RAM --> 8G of RAM', 'User upgradeable RAM --> RAM', 'Windows 7, --> Windows 7', 'Windows 8 and touchscreen functions --> Windows 8', 'Windows 8 and touchscreen functions --> touchscreen functions', 'bluetooth devices --> integrate bluetooth devices', 'bookmarks --> create your own bookmarks', 'brushed aluminum --> aluminum', 'build --> overall build', 'connectivity --> flexibility for connectivity', 'durability --> durability of the battery', #Incorrect 'extender cable --> cable', 'finger clicking --> two finger clicking', 'functions --> functions provided by the trackpad', #Incorrect 'games --> support for games', 'hard drive --> regular hard drive', 'hardware --> hardware (keyboard)', 'installation disk --> installation disk (DVD)', 'keys --> lit up keys', 'look --> looks', 'nail slot --> nail slot on the card', 'nail slot --> slot', 'performance --> performance and feature set of the hardware', 'performance,.20 inch thicker --> performance', #Incorrect 'plastic case --> slim plastic case', 'product quality,aesthetics,craftmanship --> aesthetics', 'product quality,aesthetics,craftmanship --> craftmanship', 'product quality,aesthetics,craftmanship --> product quality', 'programs --> Legacy programs', 'ram --> upgrade the ram', 'setting --> customize setting', 'slim profile --> profile', 'software --> install software', #Incorrect 'system --> log into the system', 'voice recording for my vlog --> voice recording', 'wireless Apple Keyboard --> wireless Apple Keyboard and Mouse'} #Incorrect

It is unfair to penalize all of these samples due to one token mismatch or 2-3 tokens being generated additionally in such cases. I have explicitly labeled #Incorrect for 3-4 samples that are clearly erroneous.

However, this approach is solely meant to show that instruction tuning a model improves performance. At this scale, it is not feasible to manually verify every sample. Thus we believe this generalized evaluation script handles these gray areas fairly without penalizing the model to a greater extent.

@kevinscaria Thank you very much for your patient explanation,I conducted the experiment again and fixed some bugs. The experimental results were at 0.898, which is basically similar to yours. Especially, I have read articles in recent years regarding the issue of ==, and their standards are completely consistent, such as Grace and GPT_emb mentioned in the paper

kevinscaria commented 1 year ago

Happy to help. Cheers. Best, KJS

twotwoiscute commented 1 year ago

@Lurkhunter, Hi, I am trying to run inference of task ATE using model from huggingface peovided by author.

Issue:

Can't not get the same F1 score as paper claimed , In paper the performance of ATE is :

               Lapt14   Rest14
InstructABSA2  92.30    92.10

however, I can't not reproduce the result for both Lapt14 and Rest14

What I have done

For Rest14, I try two models:

For Laptop14, I try two models as well,

The data I process :

dataset_test = load_dataset("Yaxin/SemEval2014Task4Raw", cache_dir="./Dataset", split="test")
id_te_df = huggingface2df(dataset_test, config.category)

And it's worth knowing that I extract the laptop and restaurant data uisng the key domain in dataset_test, huggingface2df is the function I convert dataset to pandas format.

Lurkhunter commented 1 year ago

your results makes sense because of his partial mathcing standard @twotwoiscute

twotwoiscute commented 1 year ago

your results makes sense because of his partial mathcing standard @twotwoiscute

@Lurkhunter Sorry, I do not quite understand what "partial mathcing" mean? Can you please explain what it mean? So how should I do to reproduce the result it had on paper?

Lurkhunter commented 1 year ago

sorry, "partial matching"

please read other comments

for gt_val in gt_list: for pred_val in pred_list:

if pred_val.lower() == gt_val.lower() or gt_val.lower() == pred_val.lower():

    if pred_val.lower() in gt_val.lower() or gt_val.lower() in pred_val.lower():        
        tp+=1
        break
twotwoiscute commented 1 year ago

@Lurkhunter My version of this project only shows

for gt_val in gt_list:
    for pred_val in pred_list:
        if pred_val.lower() == gt_val.lower() or gt_val.lower() == pred_val.lower():
            tp+=1
            break

which does not use in , so result in the paper useif pred_val.lower() in gt_val.lower() or gt_val.lower() in pred_val.lower(): to count tp ?

The table show below is copy from paper.

Model         Lapt14 Rest14 Rest15 Rest16
GPT2med       82.04 75.94 - -
GRACE         87.93 85.45 - -
BARTABSA      83.52 87.07 75.48 -
IT-MTL        76.93 - 74.03 79.41
InstructABSA1 91.40 92.76 75.23 81.48
InstructABSA2 92.30 92.10 76.64 80.32

DoGPT2med, GRACE, BARTABSA, IT-MTL count the tp in this way?

twotwoiscute commented 1 year ago

@Lurkhunter

I have read articles in recent years regarding the issue of ==, and their standards are completely consistent

Could please tell me the papers you read mentioned in this comment? Thanks!