olehmberg / winter

WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation.
Apache License 2.0
109 stars 32 forks source link

Median Conflict Resolution Utility throws IndexOutOfBoundsException #35

Closed Steffen911 closed 5 years ago

Steffen911 commented 5 years ago

What happened: I use the Median conflict resolution in one of my AttributeValueFusers and get an IndexOutOfBounds exception on some of my RecordGroups.

What I expect to happen: There should be no exception thrown in the library.

Root cause: The linked list that is used internally covers all cases that are 0 and > 1, but in case the list is of size one the exception is thrown. https://github.com/olehmberg/winter/blob/master/winter-framework/src/main/java/de/uni_mannheim/informatik/dws/winter/datafusion/conflictresolution/numeric/Median.java#L48

A possible fix would include an update to the if statement like this:

boolean isEven = list.size() % 2 == 0;
if (list.size() == 0) {
    return new FusedValue<>((Double) null);
} else if (list.size() == 1) { // Return the only element in the list as median if length == 1
    return new FusedValue<>(list.get(0));
} else if (isEven) {
    double middle = ((double) list.size() + 1.0) / 2.0;
    double median1 = list.get((int) Math.floor(middle) - 1);
    double median2 = list.get((int) Math.ceil(middle) - 1);

    return new FusedValue<>((median1 + median2) / 2.0);
} else {
    int middle = list.size() / 2;

    return new FusedValue<>(list.get(middle - 1)); // Throws indexOutOfBoundsException if middle = 0
}

Another possibility would be to round the list.size() / 2 correctly. In the current implementation all decimal places will just be removed. See https://stackoverflow.com/a/2654897/6059889 on how to get a correctly rounded integer value.

olehmberg commented 5 years ago

Hi @Steffen911, thanks for pointing that out. I fixed the index calculation for lists of odd length, which should solve your problem. Please try the current version in the development branch.

Steffen911 commented 5 years ago

I'm not able to reproduce this issue on the development branch. Thank you @olehmberg!