larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
613 stars 194 forks source link

Compare to itself during deduplication mode #251

Closed bino013 closed 6 years ago

bino013 commented 6 years ago

Hi,

Could help me figure out why during deduplication mode my programs seems to compare record to itself only?

<duke>
    <schema>
        <threshold>0.82</threshold>
        <maybe-threshold>0.80</maybe-threshold>

        <property>
            <name>NAME</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.09</low>
            <high>0.93</high>
        </property>
        <property>
            <name>COUNTRY</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.34</low>
            <high>0.87</high>
        </property>
        <property>
            <name>EMAIL</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.34</low>
            <high>0.87</high>
        </property>
        <property>
            <name>COMPANY</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.34</low>
            <high>0.87</high>
        </property>
    </schema>

    <jdbc>
        <param name="driver-class" value="org.postgresql.Driver"/>
        <param name="connection-string" value="jdbc:postgresql://localhost:5432/postgres"/>
        <param name="user-name" value="arcaleon"/>
        <param name="password" value=""/>
        <param name="query" value="SELECT * from PERSON"/>

        <column name="name" property="NAME"
                cleaner="no.priv.garshol.duke.cleaners.PersonNameCleaner" />
        <column name="country" property="COUNTRY" />
        <column name="email" property="EMAIL" />
        <column name="company" property="COMPANY" />
    </jdbc>

</duke>

Result:

Records: 0

MATCH 0.9997489395270268
NAME
  'alma b. tyson', 
  'alma b. tyson', 
COUNTRY
  'Equatorial Guinea', 
  'Equatorial Guinea', 
EMAIL
  'sollicitudin.a.malesuada@tristiquesenectuset.net', 
  'sollicitudin.a.malesuada@tristiquesenectuset.net', 
COMPANY
  'Quis Associates', 
  'Quis Associates', 

MATCH 0.9997489395270268
NAME
  'amery h. luna', 
  'amery h. luna', 
COUNTRY
  'Tonga', 
  'Tonga', 
EMAIL
  'quis.pede.Praesent@aliquamadipiscing.ca', 
  'quis.pede.Praesent@aliquamadipiscing.ca', 
COMPANY
  'Pharetra Nam Ac Corporation', 
  'Pharetra Nam Ac Corporation', 

MATCH 0.9997489395270268
NAME
  'kirk b. morton', 
  'kirk b. morton', 
COUNTRY
  'Bolivia', 
  'Bolivia', 
EMAIL
  'montes.nascetur@tortornibh.co.uk', 
  'montes.nascetur@tortornibh.co.uk', 
COMPANY
  'Nulla Industries', 
  'Nulla Industries', 

MATCH 0.9997489395270268
NAME
  'oprah o. adams', 
  'oprah o. adams', 
COUNTRY
  'United States Minor Outlying Islands', 
  'United States Minor Outlying Islands', 
EMAIL
  'vel@acfeugiatnon.ca', 
  'vel@acfeugiatnon.ca', 
COMPANY
  'Mollis Nec Institute', 
  'Mollis Nec Institute', 

MATCH 0.9997489395270268
NAME
  'brenda o. rogers', 
  'brenda o. rogers', 
COUNTRY
  'Azerbaijan', 
  'Azerbaijan', 
EMAIL
  'amet.consectetuer.adipiscing@Duis.ca', 
  'amet.consectetuer.adipiscing@Duis.ca', 
COMPANY
  'Sem Vitae Aliquam Ltd', 
  'Sem Vitae Aliquam Ltd', 

Total records: 5
Total matches: 5
Total non-matches: 0

Note: I have 5 entry in the database which all values are unique to each other.

larsga commented 6 years ago

It looks like you have two problems:

1) Your threshold is too high, so only records that are 100% equal match each other 2) Your records have no ID fields, so Duke doesn't know the records are the same

bino013 commented 6 years ago

Thanks for the help! I'm closing this now.