imglib / imglib2

A generic next-generation Java library for image processing
http://imglib2.net/
Other
293 stars 93 forks source link

Add stream methods to IterableRealInterval #336

Closed tpietzsch closed 8 months ago

tpietzsch commented 10 months ago

This PR adds IterableRealInterval.stream() and .parallelStream() default methods to access the pixel values in an image as a Stream<T>.

The stream methods rely on a default implementation of IterableRealInterval.spliterator() backed by RealCursor.

Encounter order of the streams matches that of cursors, i.e. Views.flatIterable(img).stream() yields elements in flat iteration order.

Usage examples:

Pitfalls

Note that the T elements of the stream are proxies and reused (as usual). The RealCursorSpliterator implementation takes care that a new proxy is used for each split-off prefix, so parallelStream() works as expected. However, explicit copying operations must be added, if stream elements are supposed to be retained (by stateful intermediate or terminal operations).

For example, to collect all DoubleType values between 0 and 1 into a list:

List< DoubleType > values = img.stream()
    .filter( t -> t.get() >= 0.0 && t.get() <= 1.0 )
    .map( DoubleType::copy ) // <-- this is important!
    .collect( Collectors.toList() );

The .map(DoubleType::copy) operation is necessary, otherwise the values list will contain many duplicates of the same DoubleType object (which may not even have to a value between 0 and 1). The copy could also be done before the .filter(...) operation, but it's better to do it as late as possible to avoid unnecessary creation of objects.

Performance

Initial benchmarks show that using streams (even without copying) is a lot slower than explicit for loops, for example.

Running the following benchmark

// Img<IntType> img = ArrayImgs.ints(2000, 2000);

@Benchmark
public long benchmarkForLoop() {
    long count = 0;
    for (IntType t : img) {
        if (t.get() > 127)
            ++count;
    }
    return count;
}

@Benchmark
public long benchmarkStream() {
    return img.stream().filter(t -> t.get() > 127).count();
}

results in

Benchmark                                     Mode  Cnt   Score   Error  Units
StreamBenchmark.benchmarkForLoop              avgt   20   4,537 ± 0,095  ms/op
StreamBenchmark.benchmarkStream               avgt   20  33,331 ± 0,193  ms/op

Not ideal... It may be possible to improve performance, but so far I didn't find anything that works.

However, I think this is anyway more a quality-of-life feature. (Like the RandomAccessible.getAt(...) convenience methods (https://github.com/imglib/imglib2/pull/246) which I find myself using more often then I expected, despite the performance overhead.)

Ideas

There is more to explore in this direction.

ctrueden commented 10 months ago

@tpietzsch This is awesome! Too bad about the performance, but still good to have. :grinning:

About the ImgLibStream: I think this idea is almost necessary to do, because otherwise people will definitely bump into proxy-type-object-reuse-related bugs. I'm less convinced that you need a public class wrapper, though—it could instead be only an internal Stream subclass that overrides methods as appropriate while adding no new API. If we take care to override most/all of the potential pain points, the need for a method like materialize() becomes less. Are there other new API methods that occurred to you besides those you mentioned above?

For the localizable stream elements: I like this idea. The method could just be .localizingStream() for symmetry with localizingCursor(), eh? Although I guess we probably also want .localizingParallelStream() :roll_eyes: ... But then as you say, the generics get tough. Maybe instead of baking it into the IterableRealInterval interface, some static utility methods would be easier? Like:

public static < T > Stream< RealCursor< T >> localizingStream( IterableRealInterval< T > iri ) { ... }
public static < T > Stream< Cursor< T >> localizingStream( IterableInterval< T > ii ) { ... }

This avoids the hairiness of incompatible return type of an overridden method in IterableInterval due to non-covariance.

And the code could read almost as nicely:

Img< DoubleType > myImg = ...;
List< Double > valuesPast123 = Streams.localizing( myImg )
    .filter( c -> c.getDoublePosition( 0 ) >  123.0 )
    .map( c -> c.get().getRealDouble() )
    .collect( Collectors.toList() );
tpietzsch commented 10 months ago

@ctrueden I made separate issues for the wrapper classes https://github.com/imglib/imglib2/issues/339, and the localizing streams https://github.com/imglib/imglib2/issues/338, and replied there