Question about Data - Githubissues

tshu-w commented 2 years ago

Hi, I found that the number of data under this repository does not seem to match the original one, and I would like to know if the data has been processed. For example, the original Amazon-Google has 1363, 3226 entities and 1300 matches respectively, but the numbers are less in this project.

Also I see a lot of dirty data that seems to just mix the two tables together? Is there any other processing.

gpapadis commented 2 years ago

Hi!

I apologize for the late reply.

Yes, we have processed the datasets to remove conflicting information. In the case of Clean-Clean ER, this means that we have removed entities that match with more than one entity from the other dataset. For example, in the Amazon-GP dataset, you can easily check via Excel that the original ground-truth file contains 187 duplicate ids in the left column and 9 duplicates in the right one. In our version of the dataset, we have removed these cases.

Regarding the Dirty ER cases, they are indeed a combination of the Clean-Clean ER datasets, i.e., we merged the entity profiles into a single dataset and the ground-truth contains the same duplicates as before.

Kind regards, George

tshu-w commented 2 years ago

Hi, thank you for your clear reply!

scify / JedAIToolkit

Question about Data #59