Closed tshu-w closed 2 years ago
Hi!
I apologize for the late reply.
Yes, we have processed the datasets to remove conflicting information. In the case of Clean-Clean ER, this means that we have removed entities that match with more than one entity from the other dataset. For example, in the Amazon-GP dataset, you can easily check via Excel that the original ground-truth file contains 187 duplicate ids in the left column and 9 duplicates in the right one. In our version of the dataset, we have removed these cases.
Regarding the Dirty ER cases, they are indeed a combination of the Clean-Clean ER datasets, i.e., we merged the entity profiles into a single dataset and the ground-truth contains the same duplicates as before.
Kind regards, George
Hi, thank you for your clear reply!
Hi, I found that the number of data under this repository does not seem to match the original one, and I would like to know if the data has been processed. For example, the original Amazon-Google has 1363, 3226 entities and 1300 matches respectively, but the numbers are less in this project.
Also I see a lot of dirty data that seems to just mix the two tables together? Is there any other processing.