larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
615 stars 193 forks source link

Script Cleaners / Comparators #147

Open fabriziofortino opened 10 years ago

fabriziofortino commented 10 years ago

Starting from java 6 it is possible to use the JVM as a framework for embedding scripts written in different languages (javascript, groovy, etc).

A slightly modified version of Rhino 1.6r2 (js engine) comes bundled Java SE version 6.

I think it would be really useful to implement a Cleaner (and even a Comparator) to support these languages.

This feature will enable the implementation of new cleaners / comparators without the need of recompile the code. Moreover, some of these supported languages, like python, are extremely powerful for string manipulation and a lot of code for matching algorithms is available online.

A possible configuration for the new cleaner will look like this:

<object class="no.priv.garshol.duke.cleaner.ScriptCleaner" name="JavascriptCleaner">
    <param name="engine" value="javascript"/>
    <param name="script" value="/path/to/my/script.js"/>
 </object>

WDYT?

larsga commented 10 years ago

The original Duke prototype was actually written in Python, and it was striking how much easier it was to do cleaning in that prototype than it is in the Java version. So you do definitely have a point.

One problem is that everyone will want to use their favourite language. I'd want to use Jython, some people will want JRuby, others Groovy, and some JavaScript.

Another is that we may have to add dependencies. Right now, you can actually run Duke without any dependencies at all. Not even Lucene.

But the feature does make a lot of sense...

Actually, what you describe would be easy to do without adding dependencies, provided we don't include it in the main distribution. Quite a few things fall into this category, actually. Logging, for one thing. An XML data source, perhaps. The server component. Etc.

Maybe we should make change the project layout so that we can have more modules sitting next to the core? That way, the script cleaner could go there, without adding dependencies to the core?

What do you think?

fabriziofortino commented 10 years ago

Your idea to reshape the project structure sounds. Perhaps we should open a separate issue for this?

Looking at the sources I see a potential problem: the Lucene dependency is mandatory since Lucene is the default database (see ConfigurationImpl line 82). How can we handle this?

I would wait for the project layout changes before starting to work on the script cleaners / comparators parts. Moreover, I have just implemented an ElasticSearch database (see #132) that could be included in a separate module (maybe named duke-es ?).

larsga commented 10 years ago

Apologies for leaving this so long. :-(

Yes, a separate issue for new project structure is a good idea. I think we should also raise the question briefly on the mailing list to see if there are reactions. Actually doing it should be pretty quick.

It's possible to change over to a different default database, but we could just keep the Lucene dependency and say that the core depends on Lucene and that's that.

A separate duke-es module sounds great! That's another reason to do a new structure. I'll see if I can get that done before easter.