larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

This permits to have multiple columns #206

Closed programaths closed 8 years ago

programaths commented 9 years ago

As you can see bellow, the input field TE_NOM is indexed as TE_NOM and NomSoundex in the database. With the original code, there is a column index mismatch and the program fails. The updated code take into account a one to many relation. (One source column binds to many indexed columns)

This is an important feature since it permits to to more granular rules and advanced blocking.

[...]
 <property>
            <name>TE_NOM</name>
            <comparator>be.etnic.comparator.PersonNameComparator</comparator>
            <low>0.1</low>
            <high>0.50</high>
        </property>

        <property>
            <name>TE_PRENOM</name>
            <comparator>be.etnic.comparator.PersonNameComparator</comparator>
            <low>0.1</low>
            <high>0.60</high>
        </property>
<property>
            <name>NomSoundex</name>
            <comparator>no.priv.garshol.duke.comparators.MetaphoneComparator</comparator>
            <low>0.1</low>
            <high>0.7</high>
        </property>

        <property>
            <name>PrenomSoundex</name>
            <comparator>no.priv.garshol.duke.comparators.MetaphoneComparator</comparator>
            <low>0.1</low>
            <high>0.8</high>
        </property>
        [...]

        <column name="TE_NOM" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
        <column property="NomSoundex" name="TE_NOM" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
        <column name="TE_PRENOM" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
        <column property="PrenomSoundex" name="TE_PRENOM" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
larsga commented 8 years ago

Looking at this more closely I see commit bb145a365443640f03148b993f339f945ed451b2 already added this feature. Thank you for the pull request! This was definitely an issue, but it's solved now, so closing the PR.

programaths commented 8 years ago

Yep. Two months later ;-)

larsga commented 8 years ago

Yeah, sorry. :)