Implement shortcut for 0.0 and 1.0 percentile calculations

jOOQ / jOOL

jOOλ - The Missing Parts in Java 8 jOOλ improves the JDK libraries in areas where the Expert Group's focus was elsewhere. It adds tuple support, function support, and a lot of additional functionality around sequential Streams. The JDK 8's main efforts (default methods, lambdas, and the Stream API) were focused around maintaining backwards compatibility and implementing a functional API for parallelism.

http://www.jooq.org/products

Apache License 2.0

2.08k stars 167 forks source link

Implement shortcut for 0.0 and 1.0 percentile calculations #376

Closed lukaseder closed 3 years ago

lukaseder commented 4 years ago

Th 0.0 percentile corresponds to min() and the 1.0 percentile corresponds to max(). We could implement this shortcut to improve the performance of these calculations, which can be performed in O(N) (without a sort), rather than O(N log N)

lukaseder commented 4 years ago

This improvement breaks the following test:

assertEquals(Optional.of("b"), Stream.of("a", "b").collect(percentileBy(1.0, String::length)));

It now returns "a" instead of "b". While this may be an unspecified implementation detail, users might have come to rely on the status quo. Let's be careful here.

JingHuaMan commented 3 years ago

It now returns "a" instead of "b".

Modifying the comparator can solve this problem: If two items are equal in value, force the output of the comparator to be minus. Then items that appear earlier in the input are more likely to appear earlier in the seq.

lukaseder commented 3 years ago

Modifying the comparator can solve this problem: If two items are equal in value, force the output of the comparator to be minus. Then items that appear earlier in the input are more likely to appear earlier in the seq.

As commented in the PR, this is an invalid technique. A Comparator contract mandates that if a < b, then b > a. Your implementation violates this for a == b cases, where a random operand is smaller than the other based on some observed fact (notably that this seems to have worked for our tests, but depending on the sort algorithm, it might as well not work).