larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
615 stars 194 forks source link

I can't get the match of records in csv file #236

Open rahafshareef opened 7 years ago

rahafshareef commented 7 years ago

Dears, please help me to solve this case in the below as i couldn't find the match between same records in csv file ;

CSV file contains two records

id,country,capital,area 4202,"Malta","Valletta","320" 4202,"Malta","Valletta","320"

Noting;

i have configure xml file which name is "countries.xml"

0.7 ID NAME no.priv.garshol.duke.comparators.QGramComparator 0.09 0.93 AREA AreaComparator 0.04 0.73 CAPITAL no.priv.garshol.duke.comparators.QGramComparator 0.12 0.61

and when i tried to call it from java code:

public static void main(String[] args) throws Exception { // TODO code application logic here Configuration config = ConfigLoader.load("countries.xml");

    Processor proc = new Processor(config);
    proc.addMatchListener(new PrintMatchListener(true, true, true, false,
            config.getProperties(),
            true));

   proc.deduplicate();
   proc.close();
}

the result is:

Total records: 2 Total matches: 0 Total non-matches: 2

larsga commented 7 years ago

The problem is that the two IDs are the same, so when Duke compares the two records against one another, it thinks it's comparing a record with itself, and suppresses the match. If it didn't do this Duke would report every record as a duplicate of itself.

rahafshareef commented 7 years ago

thank you so much for your kind support dear.

please i need to ask you what the language that Duke support? or in another way, can Duke process Arabic language?.