I notice that some sentences in the original SemEval XML files (Laptops_Test_Gold.xml, Restaurants_Test_Gold.) are excluded in the processed txt files (Laptops_Test_Gold.xml.seg).
The statistics do not match. Laptops_Test_Gold.xml has 654 aspect terms, while Laptops_Test_Gold.xml.seg only has 638 terms. Similarly, Restaurant_Test_Gold.xml and Restaurant_Test_Gold.xml.seg, respectively, have 1134 and 1120 terms. There are 14 aspect terms difference between xml and txt files in both domains.
For example, the sentence with id 463:26: "So noise is reduced at least 50% and the heat is much better, now it doesn't feel hot but warm" in Laptops_Test_Gold.xml does not appear in the .seg txt file.
If you have any ideas how the processed txt are generated, could you please explain why there are differences between the xml files in original SemEval dataset and the processed txt files?
Hi Song, Thanks for your excellent work.
I notice that some sentences in the original SemEval XML files (
Laptops_Test_Gold.xml
,Restaurants_Test_Gold.
) are excluded in the processed txt files (Laptops_Test_Gold.xml.seg
).The statistics do not match.
Laptops_Test_Gold.xml
has 654 aspect terms, whileLaptops_Test_Gold.xml.seg
only has 638 terms. Similarly,Restaurant_Test_Gold.xml
andRestaurant_Test_Gold.xml.seg
, respectively, have 1134 and 1120 terms. There are 14 aspect terms difference between xml and txt files in both domains.For example, the sentence with id 463:26: "So noise is reduced at least 50% and the heat is much better, now it doesn't feel hot but warm" in
Laptops_Test_Gold.xml
does not appear in the.seg
txt file.If you have any ideas how the processed txt are generated, could you please explain why there are differences between the xml files in original SemEval dataset and the processed txt files?
Thank you.