imagej / imagej-ops

ImageJ Ops: "Write once, run anywhere" image processing
https://imagej.net/libs/imagej-ops
BSD 2-Clause "Simplified" License
88 stars 42 forks source link

Median not returning mean of two middle values for even number of values #620

Closed frauzufall closed 4 years ago

frauzufall commented 4 years ago

I wanted to migrate code from Python to ImageJ2 and noticed the difference in calculating the median. In Ops, it does not return the mean of the two middle values for an even number of values.

ImageJ ij = new ImageJ();
List<DoubleType> values  = new ArrayList<>();
values.add(new DoubleType(0.));
values.add(new DoubleType(1.));
System.out.println(ij.op().stats().median(values));

.. prints 1.0 whereas python would return 0.5.

The test for median would probably not catch this since the unsigned byte values in the middle of a 100x100px image could both be 128.

ctrueden commented 4 years ago

Wow, that's an embarrassing bug!

ctrueden commented 4 years ago

@gselzer Please look into the best way of addressing this, in both the current imagej-ops and the new scijava-ops port. Thank you!

rimadoma commented 4 years ago

I'm on it for imagej-ops. DefaultMedian calls DefaultQuantile with quantile 0.5, which (as I understand it) accomplishes a slightly different task: it returns the number in the array with the nearest rank. I'm thinking of rewriting DefaultMedian with this algorithm that guarantees O(n) time.

imagejan commented 4 years ago

This bug might have been introduced by the "merger" of the Median, Percentile and Quantile ops in #318 (https://github.com/imagej/imagej-ops/commit/cd382ca303132d10af1bf97105b86b3795318a10).

@rimadoma wrote:

accomplishes a slightly different task: it returns the number in the array with the nearest rank.

According to the Quantile definition, it still should return the average of two values for even-sized distributions. Is this a different case for Percentile?

I'm thinking of rewriting DefaultMedian with this algorithm that guarantees O(n) time.

It seems that's how it was before #318. Do you think quantile could benefit from the same fast algorithm, selecting the k-th smallest element for any quantile?

rimadoma commented 4 years ago

According to the Quantile definition, it still should return the average of two values for even-sized distributions.

My interpretation of the current implementation is that it returns the nearest rank. It's always return array.get(k).

This bug might have been introduced by the "merger" of the Median, Percentile and Quantile ops in #318

I see return array.get(k) instead of return (array.get(k) + return array.get(k - 1)) / 2.0 for even n arrays in the old implementation too.

Do you think quantile could benefit from the same fast algorithm, selecting the k-th smallest element for any quantile?

I had a quick look at the paper the algorithm I linked is based on, and it seems to generalisable for quantiles. However, I'll need help there. I could get a PR started for the median implementation, and then ask for help?

Update: I've created a branch with the median implementation

rimadoma commented 4 years ago

So I figured that DefaultQuantile and the code I linked both use quick select to find the median/quantile. The only difference is in how they pick the pivot value. The one I linked has O(n), worst case, but since the O(n^2) worst case for a random pivot is rare, the former doesn't offer much real benefit. Hence, I'm just gonna add a little bug fix instead of reimplementing.

ctrueden commented 4 years ago

Thanks a lot @rimadoma! And I'm glad you didn't rewrite—I should hope the time complexity of rank select is good enough.

frauzufall commented 4 years ago

Seems fixed, thanks @rimadoma!