larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
615 stars 193 forks source link

Null value returned from TrimCleaner #211

Closed iDmple closed 9 years ago

iDmple commented 9 years ago

Hi Lars,

I'm using TrimCleaner from Yann's plugin for entity resolution.

public class TrimCleaner implements Cleaner {

public String clean(String value) { value = value.trim(); if (value.equals("")) return null; return value; }

}

Some of my input strings are emtpy "", thus TrimCleaner returns null which provokes a NullPointerException. Why not return "" when an empty string is trimmed?

How can I avoid this problem?

Best regards

larsga commented 9 years ago

The cleaner shouldn't be called at all if the value is null. Can you post a traceback?

I guess the TrimCleaner should be ported to the main Duke codebase as well. I can do that.

iDmple commented 9 years ago

OK, in the meantime I have found the problem. I think it would be great to throw an exception or something to avoid that the cleaner is called with an invalid input.

Here is the code that fails:

private void readFields(List<Map<String, Object>> fields, Map<String, Collection> props) { for (Map<String, Object> value : fields) { String field = (String) value.get("field"); String fieldValue = (String) value.get("value"); for (Cleaner cl : (ArrayList) entityParams.get(field).get(CLEANERS)) { fieldValue = cl.clean(fieldValue); } props.put(field, Collections.singleton(fieldValue)); } }

There the fields come directly from the request and could be empty or null.

I replaced it with:

private void readFields(List<Map<String, Object>> fields, Map<String, Collection<String>> props) {
    for (Map<String, Object> value : fields) {
        String field = (String) value.get("field");
        String fieldValue = value.get("value").toString();
        for (Cleaner cl : (ArrayList<Cleaner>) entityParams.get(field).get(CLEANERS)) {
            if(fieldValue!=null && !fieldValue.equals("")) {
                fieldValue = cl.clean(fieldValue);
            }
        }
        props.put(field, Collections.singleton(fieldValue));
    }
}

I made lots of other changes in the plugin because it fails from everywhere with NullPointerException and cast problems. Will do a pull request when my project is ready and I find no more errors.

I think that duke API should be used and not call internal functions like this. It would be better to have elasticsearch as a datasource in fact.

larsga commented 9 years ago

As far as I can tell, this is all issues with the plugin, and not with the Duke code, so I am going to close this.

I agree about the data source.