larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

Does anyone know how to configure for JSON dataSource #258

Closed Anishx closed 5 years ago

Anishx commented 5 years ago

how to write config.xml for json data source

<data-source class="no.priv.garshol.duke.datasources.JsonDataSource">
  <param name="input-file" value="whatever.json"/>

  ...

the above lines are ambiguous in the documentation. Perhaps it would be well suited with an example & there're no examples to describe this . . . Kindly please help

uderline commented 5 years ago

Hi ! Have you tried the most common way ?

<data-source ...>
<column name="col1" property="name_property">

The entire xml file would look like:

<duke>
    <schema>
        <threshold></threshold>
        <property></property>
    </schema>
    <data-source ...>
        <param ... />
        <column .... />
    </data-source />
</duke>
Anishx commented 5 years ago

@uderline i somehow got it working but jsons can have several keys inside with same, how to specify where to find the value in json, i have a JSON file in which the data looks like (int the file JSON.txt) below, how do i choose "id" and "name" and "extension's url" ( for example ) ? and there's no example here to demostrate that

JSON.txt

and i tried another thing, linking 2 json files and i got a blank space in the cmd i used the below xml file XML.txt

the sample json i used JSONSAMPLE.txt

uderline commented 5 years ago

It seams like you cannot specify a specific key in a key like extension.url like the MongoDB source. If you absolutely need to have url and url in extension, I would change the name of the key.

For the config, make the properties (e.g. id, name and extension) in the <schema>. Then, make the columns in the <data-source>. These are described in the xml config file.

<schema>
        <threshold>0.8</threshold>

        <property type="id">
            <name>ID</name>
        </property>

        <property>
            <name>NAME</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.09</low>
            <high>0.93</high>
        </property>
        <property>
            <name>URL</name>
            <comparator> no.priv.garshol.duke.comparators.Levenshtein </comparator>
            <low>0.04</low>
            <high>0.73</high>
        </property>
    </schema>

    <database class="no.priv.garshol.duke.databases.InMemoryDatabase">
    </database>

        <data-source class="no.priv.garshol.duke.datasources.JsonDataSource">
            <param name="input-file" value="JSON.json" />
            <column name="id" property="ID" />
            <column name="name" property="NAME"  />
            <column name="url" property="URL" />
        </data-source>

Hope that helps

Anishx commented 5 years ago

@uderline but this json syntax is used for another huge application, i may have to change the json file specifically for this purpose, i suppose . . .

Anishx commented 5 years ago

but the scope of this issue is closed i guess, it kinda works now . . . Thank you @uderline

ashubitm commented 5 years ago

Hi , Can this be used for continuous stream of Json as well ?How do we configure inthat case ?Regards,ashutosh

Sent from Yahoo Mail for iPhone

On Tuesday, October 30, 2018, 2:29 pm, Anish notifications@github.com wrote:

but the scope of this issue is closed i guess, it kinda works now . . . Thank you @uderline

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

uderline commented 5 years ago

Hi @ashubitm , I guess not because the dataset is saved in memory or indexed in a Lucene DB before the dedup/linkage process starts. That's why I made myself a plugin for Elasticsearch which will link the ingested data. Put that with Logstash (for the stream) and you'll have the perfect combo ;) But that's another issue/project.