scify / JedAIToolkit

An open source, high scalability toolkit in Java for Entity Resolution.
http://jedai.scify.org
Apache License 2.0
209 stars 47 forks source link

GtCSVReader problems with jgrapht ConnectivityInspector #44

Closed florisheijmans closed 3 years ago

florisheijmans commented 3 years ago

This issue arose when I attempted to reproduce the workflow in: org.scify.jedai.demoworkflows.CsvDblpAcm.java.

During the reading process of the ground truths in DBLP-ACM_perfectMapping.csv (specifically the GtCSVReader.getDuplicatePairs method), the detection of connected components by the jgrapht package seems to not work.

For some reason I obtain a single cluster of size 2225 and then 5375 more clusters of size 1, which is obviously incorrect since the csv contains about 2225 unique pairs (which should in turn produce 2225 clusters of size 2).

Have you seen this problem before? Maybe the jgrapht package expects a different format than it did previously?

mthanos commented 3 years ago

Hi,

The problem indeed occurs because of the jgrapht update. The ' " ' characters are by default trimmed by the updated ConnectivityInspector and thus the ids are not recognized as existing keys when processed by the gt reader. We will have a more detailed look on this. To resolve this issue for now you can remove those characters from the DPLP input file, or of course use a generally modified format.

Best regards, Manos


From: florisheijmans notifications@github.com Sent: Thursday, January 14, 2021 5:11 PM To: scify/JedAIToolkit JedAIToolkit@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [scify/JedAIToolkit] GtCSVReader problems with jgrapht ConnectivityInspector (#44)

This issue arose when I attempted to reproduce the workflow in: org.scify.jedai.demoworkflows.CsvDblpAcm.java.

During the reading process of the ground truths in DBLP-ACM_perfectMapping.csv (specifically the GtCSVReader.getDuplicatePairs method), the detection of connected components by the jgrapht package seems to not work.

For some reason I obtain a single cluster of size 2225 and then 5375 more clusters of size 1, which is obviously incorrect since the csv contains about 2225 unique pairs.

Have you seen this problem before? Maybe the jgrapht package expects a different format than it did previously?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/scify/JedAIToolkit/issues/44, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEVOMDYDXNAD4UCSR4JJ3BLSZ4JSDANCNFSM4WCVBWCA.

florisheijmans commented 3 years ago

Thank you! That fixes the problem.