Closed Lurkhunter closed 1 year ago
@Lurkhunter,
The numbers that you report are too low. It could be because the dataset that you are using is not processed exactly the same way as I did. The difference that I get is around 4% from the complete and incomplete match. I will be sharing the datasets in a while and you can try running it again.
Also as I mentioned before, there is no reliable way to test the performance of the text generated. But since it is fine-tuned on a given dataset, we ground the possibilities of erroneous text being generated. During the deployment phase, the text generated would have to be further grounded by rules suiting the domain.
We believe the metric that we have defined is a bit unconstrained. However, it was finalized after empirically checking the outputs from different runs. I am attaching the output of {predicted --> ground truth} cases that fall in this gray area. I have manually tagged the
{'ADSl cable --> configure for ADSl cable or wifi', 'CPU --> fourth-generation Haswell CPU', #Probably Incorrect 'CPU --> third-generation CPU ("Ivy Bridge")', #Probably Incorrect 'Customization on mac --> Customization', 'Intel 4000 graphics --> integrated Intel 4000 graphics', 'Mac OS improvement --> Mac OS', 'Mountain Lion --> install Mountain Lion', #Incorrect 'Premium price for the OS --> OS', 'Premium price for the OS --> price', 'RAM --> 4G of RAM', 'RAM --> 8G of RAM', 'User upgradeable RAM --> RAM', 'Windows 7, --> Windows 7', 'Windows 8 and touchscreen functions --> Windows 8', 'Windows 8 and touchscreen functions --> touchscreen functions', 'bluetooth devices --> integrate bluetooth devices', 'bookmarks --> create your own bookmarks', 'brushed aluminum --> aluminum', 'build --> overall build', 'connectivity --> flexibility for connectivity', 'durability --> durability of the battery', #Incorrect 'extender cable --> cable', 'finger clicking --> two finger clicking', 'functions --> functions provided by the trackpad', #Incorrect 'games --> support for games', 'hard drive --> regular hard drive', 'hardware --> hardware (keyboard)', 'installation disk --> installation disk (DVD)', 'keys --> lit up keys', 'look --> looks', 'nail slot --> nail slot on the card', 'nail slot --> slot', 'performance --> performance and feature set of the hardware', 'performance,.20 inch thicker --> performance', #Incorrect 'plastic case --> slim plastic case', 'product quality,aesthetics,craftmanship --> aesthetics', 'product quality,aesthetics,craftmanship --> craftmanship', 'product quality,aesthetics,craftmanship --> product quality', 'programs --> Legacy programs', 'ram --> upgrade the ram', 'setting --> customize setting', 'slim profile --> profile', 'software --> install software', #Incorrect 'system --> log into the system', 'voice recording for my vlog --> voice recording', 'wireless Apple Keyboard --> wireless Apple Keyboard and Mouse'} #Incorrect
It is unfair to penalize all of these samples due to one token mismatch or 2-3 tokens being generated additionally in such cases. I have explicitly labeled #Incorrect for 3-4 samples that are clearly erroneous.
However, this approach is solely meant to show that instruction tuning a model improves performance. At this scale, it is not feasible to manually verify every sample. Thus we believe this generalized evaluation script handles these gray areas fairly without penalizing the model to a greater extent.
@Lurkhunter,
The numbers that you report are too low. It could be because the dataset that you are using is not processed exactly the same way as I did. The difference that I get is around 4% from the complete and incomplete match. I will be sharing the datasets in a while and you can try running it again.
Also as I mentioned before, there is no reliable way to test the performance of the text generated. But since it is fine-tuned on a given dataset, we ground the possibilities of erroneous text being generated. During the deployment phase, the text generated would have to be further grounded by rules suiting the domain.
We believe the metric that we have defined is a bit unconstrained. However, it was finalized after empirically checking the outputs from different runs. I am attaching the output of {predicted --> ground truth} cases that fall in this gray area. I have manually tagged the
{'ADSl cable --> configure for ADSl cable or wifi', 'CPU --> fourth-generation Haswell CPU', #Probably Incorrect 'CPU --> third-generation CPU ("Ivy Bridge")', #Probably Incorrect 'Customization on mac --> Customization', 'Intel 4000 graphics --> integrated Intel 4000 graphics', 'Mac OS improvement --> Mac OS', 'Mountain Lion --> install Mountain Lion', #Incorrect 'Premium price for the OS --> OS', 'Premium price for the OS --> price', 'RAM --> 4G of RAM', 'RAM --> 8G of RAM', 'User upgradeable RAM --> RAM', 'Windows 7, --> Windows 7', 'Windows 8 and touchscreen functions --> Windows 8', 'Windows 8 and touchscreen functions --> touchscreen functions', 'bluetooth devices --> integrate bluetooth devices', 'bookmarks --> create your own bookmarks', 'brushed aluminum --> aluminum', 'build --> overall build', 'connectivity --> flexibility for connectivity', 'durability --> durability of the battery', #Incorrect 'extender cable --> cable', 'finger clicking --> two finger clicking', 'functions --> functions provided by the trackpad', #Incorrect 'games --> support for games', 'hard drive --> regular hard drive', 'hardware --> hardware (keyboard)', 'installation disk --> installation disk (DVD)', 'keys --> lit up keys', 'look --> looks', 'nail slot --> nail slot on the card', 'nail slot --> slot', 'performance --> performance and feature set of the hardware', 'performance,.20 inch thicker --> performance', #Incorrect 'plastic case --> slim plastic case', 'product quality,aesthetics,craftmanship --> aesthetics', 'product quality,aesthetics,craftmanship --> craftmanship', 'product quality,aesthetics,craftmanship --> product quality', 'programs --> Legacy programs', 'ram --> upgrade the ram', 'setting --> customize setting', 'slim profile --> profile', 'software --> install software', #Incorrect 'system --> log into the system', 'voice recording for my vlog --> voice recording', 'wireless Apple Keyboard --> wireless Apple Keyboard and Mouse'} #Incorrect
It is unfair to penalize all of these samples due to one token mismatch or 2-3 tokens being generated additionally in such cases. I have explicitly labeled #Incorrect for 3-4 samples that are clearly erroneous.
However, this approach is solely meant to show that instruction tuning a model improves performance. At this scale, it is not feasible to manually verify every sample. Thus we believe this generalized evaluation script handles these gray areas fairly without penalizing the model to a greater extent.
@kevinscaria Thank you very much for your patient explanation,I conducted the experiment again and fixed some bugs. The experimental results were at 0.898, which is basically similar to yours. Especially, I have read articles in recent years regarding the issue of ==
, and their standards are completely consistent, such as Grace
and GPT_emb
mentioned in the paper
Happy to help. Cheers. Best, KJS
@Lurkhunter, Hi, I am trying to run inference of task ATE using model from huggingface peovided by author.
Can't not get the same F1 score as paper claimed , In paper the performance of ATE
is :
Lapt14 Rest14
InstructABSA2 92.30 92.10
however, I can't not reproduce the result for both Lapt14
and Rest14
For Rest14
, I try two models:
F1 score
I get is 0.83
which is far from 92.10
.F1 score
I get is which is 0.85
far from 92.10
as well. For Laptop14
, I try two models as well,
One is ate_tk-instruct-base-def-pos-neg-neut-combined
, I got F1 score
I got is 0.897
, which is far from 92.30
The other one is ate_tk-instruct-base-def-pos-neg-neut-laptops, I got F1 score 0.88
, which is far from 92.30
as well.
The data I process :
dataset_test = load_dataset("Yaxin/SemEval2014Task4Raw", cache_dir="./Dataset", split="test")
id_te_df = huggingface2df(dataset_test, config.category)
And it's worth knowing that I extract the laptop
and restaurant
data uisng the key domain
in dataset_test
, huggingface2df is the function I convert dataset to pandas
format.
your results makes sense because of his partial mathcing standard @twotwoiscute
your results makes sense because of his partial mathcing standard @twotwoiscute
@Lurkhunter Sorry, I do not quite understand what "partial mathcing" mean? Can you please explain what it mean? So how should I do to reproduce the result it had on paper?
sorry, "partial matching"
please read other comments
for gt_val in gt_list: for pred_val in pred_list:
if pred_val.lower() in gt_val.lower() or gt_val.lower() in pred_val.lower():
tp+=1
break
@Lurkhunter My version of this project only shows
for gt_val in gt_list:
for pred_val in pred_list:
if pred_val.lower() == gt_val.lower() or gt_val.lower() == pred_val.lower():
tp+=1
break
which does not use in
, so result in the paper useif pred_val.lower() in gt_val.lower() or gt_val.lower() in pred_val.lower():
to count tp
?
The table show below is copy from paper.
Model Lapt14 Rest14 Rest15 Rest16
GPT2med 82.04 75.94 - -
GRACE 87.93 85.45 - -
BARTABSA 83.52 87.07 75.48 -
IT-MTL 76.93 - 74.03 79.41
InstructABSA1 91.40 92.76 75.23 81.48
InstructABSA2 92.30 92.10 76.64 80.32
DoGPT2med, GRACE, BARTABSA, IT-MTL
count the tp
in this way?
@Lurkhunter
I have read articles in recent years regarding the issue of ==, and their standards are completely consistent
Could please tell me the papers you read mentioned in this comment? Thanks!
I tried to build datasets for Laptop and Rest14, and then I conducted experiments on
aspect term extraction
onLaptop
, different conditions were used to determinetrue positive
in theget_metrics
function, and the results are as follows: Complete:Inomplete:
code: untils.py