imglib / imglib2-ij

Translation between ImgLib & ImageJ data structures (both 1.x and 2)
Other
4 stars 8 forks source link

io package #23

Closed acardona closed 5 years ago

acardona commented 5 years ago

An minimal interface for loading images from file paths or URLs as stacks, which critically enables lazy loading of large image series. The lazy functionality enables a researcher to load and reason about e.g. an entire multi-terabyte 4D data set, but then hyperslice and work with specific sections of it, without incurring in the cost of loading the whole 4D series (which might be impossible given RAM limitations).

The equivalent code would be:

final RandomAccessibleInterval< ? > img = Views.stack( Arrays.stream( paths ) .map( IJ::openImage ) .map( ImageJFunctions::wrap ) .collect( Collectors.toList() ) );

... except that the above is not lazy, but instead loads all paths.

Copying from the commit message:

Load: a class with static methods for loading arrays of file paths as Img or RandomAccessibleInterval.

Offers methods:

The Load.lazyStack methods solves the problems of:

The file path Loader interface is abstract, and a concrete IJLoader class is provided which uses ImageJ's IJ.openImage().

For Load.lazyStack:

ctrueden commented 5 years ago

Why not use the DatasetIOService#open method? It offers CellImg objects backed by imglib2-cache.

acardona commented 5 years ago

Because IJ.openImage has shown to have better performance than scifio for a number of image formats, including .zip and .tiff stacks. And because I haven't been able to find examples or usage documentation on scifio. And because scifio adds a layer of complexity over plain Img or RandomAccessibleInterval that is not always conducive to trying out image processing solutions.

The classes I created here are all agnostic to the image loading library and a Loader that uses scifio could be provided if someone wanted it. The point here is to provide a trivial way to lazy load data, getting an Img/RandomAccessibleInterval right away, not wrapped in abstractions that provide additional data like calibration etc. such as the Dataset, and to do so using the ImageJ IO library that I used to use until now and that I understand well.

acardona commented 5 years ago

To exemplify my point about scifio: the link you provided points to an interface. Which classes implement it is not easy to discover (I have in mind here non-programmers).

Also, the open methods are encumbered with a SCIFIOConfig parameter: why can't there be a simple, openLazy(String[] paths) and get that? In addition, the open method expects a single String, even the openAll method does--what does that even mean? That it will open all files in the directory? In what order? Instead of asking the user to learn to use the SCIFIOConfig which I guess offers way to filter files and sort them, it's far easier to ask the user to apply generic methods she already knows for getting lists of file paths, filtering them and sorting them, and then providing that to the IO library.

SCIFIO has fallen short of usability for casual users for years. A simplification layer over SCIFIO is very much needed if SCIFIO is to be used. Lacking such simplification, this proposed mini-library provides just what is needed to get the job done in one brief single command.

ctrueden commented 5 years ago

The classes I created here are all agnostic to the image loading library and a Loader that uses scifio could be provided if someone wanted it.

Then the Loader does not belong in this component (imglib2-ij), but rather somewhere more general. Only the IJLoader belongs here.

SCIFIO has fallen short of usability for casual users for years. A simplification layer over SCIFIO is very much needed if SCIFIO is to be used. I haven't been able to find examples or usage documentation on scifio.

Like this?

I really don't know how to make it any simpler than:

image = ij.io().open("/path/to/myDataset")
acardona commented 5 years ago

Thanks for these: they are trivial examples, not capturing e.g. lazy loading of multi-terabyte 4D time series. They are not representative of actual lab usage. And they are scifio: some file formats have horrendous performance, such as zip files. Even opening KLB files via the "ImageJ2" checkbox enabled in Fiji is much slower than directly using the KLB library. There's something fundamentally off about the scifio framework. In addition: I find the magical "#@" parameters confusing, and hard to look up documentation for, and I am not alone on this: in the field, users prefer importing classes whose documentation they can look up, and I am with them on this. My impression is that the "#@" approach fell short of the promise of simplification, and added obfuscation and complexity instead.

On the Loader not belonging to imglib2-ij: imglib2 core wants to remain simple, and has so far avoided tying itself up with any IO interfaces. Hence imglib2-ij seems like a good place for the io lib. Otherwise, I can create a new repository imglib2-io, but the mere thought of the POM hell and release hell puts me off from that. imgib2-ij is simple enough and given the use cases and that it's already a part of Fiji, it seems like a good home for the io package, particularly given the dependency of IJLoader on ImageJ.

acardona commented 5 years ago

To further add that .zip and gzip and bzip2 formats are more and more common, as labs try to cope with data storage costs of large data sets. SCIFIO being slow with them is a major problem.

What's more, this motivates me to add IJLoader subclasses that can unpack compressed formats not currently handled by ImageJ such as bz2. At least there'll be a quick solution.

tpietzsch commented 5 years ago

I dislike that this does not reuse imglib2-cache framework which was made exactly for stuff like this. You already have a loader for Cell, it should be quite easy to make this work. Maybe this example is good to get started. (not exactly what you need, but very close).

I don't want to push don't-reinvent-the-wheel for pure aesthetic reasons too much, but in this case imglib2-cache is just a superior wheel... For example, imglib2 caches load data (e.g. Cell) synchronized per element, not in the blanket one-load-blocks-all-other-loads way that your SoftCachingLoader does it. So you get multi-threaded loading. You get for free the option to make the images writable (modifications cached out to disk, and read-back of modifications coordinated with cell loading etc). CachedCellImg variants play nicely with BigDataViewer: You can make volatile views that will not block visualization for loading cells, instead asynchronously handling cell loading in a loader thread pool. (This should by the way also work for SCIFIOCellImg which is based on CachedCellImg). There are various cache types (weakref, softref, bounded strongref, bounded softref + weakref, ...) that can be trivially exchanged. Caches and thread pools can be shared between multiple images. Etc, etc.

What I do like is the AccessProxy stuff for wrapping arbitrary image types. This should be really useful. I would maybe split this out and put it into imglib2-cache or even core.

acardona commented 5 years ago

Hi @tpietzsch,

Thanks for pointing out imglib2-cache. Been meaning to have a good look at it. It is powerful, and also not trivial to grasp. The examples you pointed to will help a lot.

A point for which I need clarification: I was under the impression that only LazyCellImg was loaded lazily. I see that the CheckerboardLoader from your example implements a form of lazy loading, backed up by a cache.

Will reimplement Load using imglib2-cache following the example above.

On the proxies: glad you find them useful. They are, in some sense, a hack, or a bypass. My expectation going in was that at the level of Cell there would be support for holding an IterableInterval or a RandomAccessibleInterval, but to my surprise Cell only handles arrays. Hence the proxies for images of types other than ArrayImg.

The one performance drawback of these proxies is that they are based on RandomAccess even when linear read out with a Cursor would be desired. Presently I don't know of a way out of this.

Notice that I only implemented 3 proxies, for GenericByteType, GenericShortType and RealType, the latter working as a float[] array and being a sort of catch-all. More types could/should be added for e.g. GenerincIntType and GenericLongType, which shouldn't be handled as floats. They would be trivial to add.

The decision is yours to move the proxies to imglib2-core, or to imglib2-cache. I'd be happy that they exist anywhere.

In summary:

  1. I would like something like Load.lazyStack(List<String> paths) to exist.
  2. Will reimplement Load.lazyStack to use imglib2-cache. That means wherever this IO package ends up, it will depend on imglib2-cache.
  3. I would like something like IJLoader to exist, purely for performance reasons, particularly for image formats like ".zip" that are currently severely mishandled by SCIFIO (we are talking multiple orders of magnitude slow downs in loading). The imglib2-ij is a good package for IJLoader.
  4. Would like the proxies to exist in a repository, doesn't matter to me which one.

The question remains as to what to do about the abstract Loader and therefore Load itself, which is agnostic to any IO library. I'm fine with them staying here in imglib2-ij. If eventually they get moved elsewhere, that's fine with me.

Finally to note that the thought ocurred to me that what I wanted was a View of List<String> as a List<Img>. I understand the package partitioning that you all always want precludes this. But it would make so much sense to the end user.

acardona commented 5 years ago

I have just pushed a new method Load.lazyStackCache that uses the ReadOnlyCachedCellImgFactory.createWithCacheLoader. It works but has issues that seem related to the dimensions of the cell, but these are exactly the same as before.

Would appreciate if @tpietzsch could have a look. For testing, the 3 .zip test images are here: https://www.dropbox.com/sh/gd1lnpzkxdjsh1a/AABmTNtQ1NoKuha_Rw9HOzzsa?dl=0

tpietzsch commented 5 years ago

On 22. Oct 2018, at 15:47, Albert Cardona notifications@github.com wrote:

Hi @tpietzsch https://github.com/tpietzsch,

Thanks for pointing out imglib2-cache. Been meaning to have a good look at it. It is powerful, and also not trivial to grasp. The examples you pointed to will help a lot.

A point for which I need clarification: I was under the impression that only LazyCellImg was loaded lazily. I see that the CheckerboardLoader from [your example] (https://github.com/imglib/imglib2-cache-examples/blob/master/src/main/java/net/imglib2/cache/lowlevel/example03/Example03.java https://github.com/imglib/imglib2-cache-examples/blob/master/src/main/java/net/imglib2/cache/lowlevel/example03/Example03.java) implements a form of lazy loading, backed up by a cache.

It is lazily loaded in the sense that a Cell is only loaded when a pixel value inside it is accessed.

The volatile goes beyond that. It is useful for interactive rendering (and not much else). Lazily loaded as described above still means that a RandomAccess reading pixels, when entering an unloaded Cell, blocks until the loading is done. For rendering in BDV that means that rendering a frame blocks until all necessary data is loaded. The volatile view will have a different type (UnsignedShortType will become VolatileUnsignedShortType for example) that has a boolean isValid() method. When a RandomAccess from the BDV renderer enters a unloaded cell, that cell will be immediately there, albeit with invalid data. The actual loading is enqueued for being done asynchronously ASAP. When loading is eventually finished, the cell data will become valid. BDV renders invalid data as zero values and retries painting until all painted pixels come from valid data. (That volatile type and rendering thing was Saalfelds idea originally.)

Importantly, the volatile view shares the same cache and data as the original image. When you access an unloaded pixel in the original image, it blocks until loaded, as it should for any processing or purpose other than visualization. Data loaded in this way will be immediately there in the volatile view. And vice versa. And nothing is loaded twice, even when accessing unloaded data simultaneously by both ways.

(To make this work, accesses need a volatile flag, which I mentioned as being the next step in my previous email...)

Will reimplement Load using imglib2-cache following the example above.

On the proxies: glad you find them useful. They are, in some sense, a hack, or a bypass. My expectation going in was that at the level of Cell there would be support for holding an IterableInterval or a RandomAccessibleInterval, but to my surprise Cell only handles arrays. Hence the proxies for images of types other than ArrayImg.

The one performance drawback of these proxies is that they are based on RandomAccess even when linear read out with a Cursor would be desired. Presently I don't know of a way out of this.

Notice that I only implemented 3 proxies, for GenericByteType, GenericShortType and RealType, the latter working as a float[] array and being a sort of catch-all. More types could/should be added for e.g. GenerincIntType and GenericLongType, which shouldn't be handled as floats. They would be trivial to add.

Yes, the CellImg is a NativeImg and does everything on Accesses directly. For historical and performance reasons.

There is incomplete work about implementing something similar for general RandomAccessibleIntervals as cells https://github.com/imglib/imglib2/pull/179 https://github.com/imglib/imglib2/pull/179 But there is still problems with this (stuck somewhere in a german email discussion with @dietzc waiting for comments from @MarcelWiedenmann, I should really dig this out and put it into the github PR, ...) And there is not thought about lazy initialization, caching, etc in there yet.

So, this is an open problem, I know. I was thinking that the ProxyAccesses are a nice workaround for the time being.

Thinking about it, I just noticed one problem about the ProxyAccesses, that is that their getValue() and setValue() methods need to be synchronized, Otherwise multiple RandomAccesses in the same cell will interfere in multithreaded scenario. This in turn means that there maybe a performance hit. Argh...

The decision is yours to move the proxies to imglib2-core, or to imgbli-2cache. I'd be happy that they exist anywhere.

In summary:

I would like something like Load.lazyStack(List paths) to exist. Will reimplement Load.lazyStack to use imglib2-cache. That means wherever this IO package ends up, it will depend on imglib2-cache. I would like something like IJLoader to exist, purely for performance reasons, particularly for image formats like ".zip" that are currently severely mishandled by SCIFIO (we are talking multiple orders of magnitude slow downs in loading). The imglib2-ij is a good package for IJLoader. Would like the proxies to exist in a repository, doesn't matter to me which one. The question remains as to what to do about the abstract Loader and therefore Load itself, which is agnostic to any IO library. I'm fine with them staying here in imglib2-ij. If eventually they get moved elsewhere, that's fine with me.

I would leave them here until the PR settles, then move them to cache or core. Does anyone have opinions of where they should best live?

Finally to note that the thought ocurred to me that what I wanted was a View of List as a List. I understand the package partitioning that you all always want precludes this. But it would make so much sense to the end user.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/imglib/imglib2-ij/pull/23#issuecomment-431840590, or mute the thread https://github.com/notifications/unsubscribe-auth/AAl99jlc9U0UV4TS2rZ1Khxz_V1jliV6ks5uncxigaJpZM4XxJ_E.

tpietzsch commented 5 years ago

On 22. Oct 2018, at 15:47, Albert Cardona notifications@github.com wrote:

A point for which I need clarification: I was under the impression that only LazyCellImg was loaded lazily. I see that the CheckerboardLoader from [your example] (https://github.com/imglib/imglib2-cache-examples/blob/master/src/main/java/net/imglib2/cache/lowlevel/example03/Example03.java https://github.com/imglib/imglib2-cache-examples/blob/master/src/main/java/net/imglib2/cache/lowlevel/example03/Example03.java) implements a form of lazy loading, backed up by a cache.

To clarify: CachedCellImg extends LazyCellImg. LazyCellImg is the “hook” im imglib2 core for every CellImg variant that obtains its Cells through the Get interface (== Cell<A> get(long index)) As happens so often, naming is maybe not ideal...

CachedCellImg is a minimal extension in imglib2-cache and uses a Cache<Long, Cell<A>> as the LazyCellImg.Get. The variations (whether modified Cells are written to disk, whether new Cells are empty or loaded from somewhere, etc) are down to the specific Cache used (or more precisely the specific CacheLoader and CacheRemover).

tpietzsch commented 5 years ago

One possible solution for correct multi-threaded access to the proxies is to use, instead of synchronized getValue/setValue, a Map<Thread, RandomAccess>, and service each thread with its own. That adds an additional method call to get(Thread) the RandomAccess at each getValue/setValue call, which, while worse than no call, is better than having concurrent threads waiting on each other.

But the single-pixel granularity of getValue/setValue is perhaps the wrong place to setup the synchronization or the Map approach from above. Instead, wherever the proxy is requested, a new Proxy with its own RandomAccess should be returned, so that no thread collisions can occur. That though delegates the problem of concurrent reads and writes to the underlying RandomAccessibleInterval, which may or may not be an issue.

acardona commented 5 years ago

This was written by me, not by @tpietzsch : the email system screwed up. Not github's best day.

One possible solution for correct multi-threaded access to the proxies is to use, instead of synchronized getValue/setValue, a Map<Thread, RandomAccess>, and service each thread with its own. That adds an additional method call to get(Thread) the RandomAccess at each getValue/setValue call, which, while worse than no call, is better than having concurrent threads waiting on each other. But the single-pixel granularity of getValue/setValue is perhaps the wrong place to setup the synchronization or the Map approach from above. Instead, wherever the proxy is requested, a new Proxy with its own RandomAccess should be returned, so that no thread collisions can occur. That though delegates the problem of concurrent reads and writes to the underlying RandomAccessibleInterval, which may or may not be an issue.

tpietzsch commented 5 years ago

I have just pushed a new method Load.lazyStackCache that uses the ReadOnlyCachedCellImgFactory.createWithCacheLoader. It works but has issues that seem related to the dimensions of the cell, but these are exactly the same as before.

Would appreciate if @tpietzsch could have a look. For testing, the 3 .zip test images are here: https://www.dropbox.com/sh/gd1lnpzkxdjsh1a/AABmTNtQ1NoKuha_Rw9HOzzsa?dl=0

You need to give the cell dimensions as an option to the factory

ReadOnlyCachedCellImgOptions.options().cellDimensions(dimensions_cell)

And you forgot this line in the lazyStackCached method

dimensions_cell[ dimensions_cell.length - 1 ] = 1;
tpietzsch commented 5 years ago

@acardona Note that your lazyStackCached() will use the accessproxies instead of "unfolding" the PlanarImg into slices like the lazyStack() --> lazyPlanarImgStack() does. If you want to do the same, I would add in a second cache for the loaded Imgs. That would also solve the "Load the first cell twice" problem you mentioned in last commit message.

You can do for example:

        final UncheckedCache< Integer, Img< T > > imgs =
                new SoftRefLoaderCache< Integer, Img< T > >()
                        .withLoader( i -> loader.load( paths[ i ] ) )
                        .unchecked();
        final Img< T > first = imgs.get( 0 );

(and then

    return new Cell<>( dimensions_cell, min, extractDataAccess( imgs.get( index.intValue() ) ) );

in the CacheLoader to avoid that.

Then to handle PlanarImg, next step is to adjust cell dimensions to be only 2D slices, extract slice access in the CacheLoader, etc (similar to lazyPlanarImgStack()).

acardona commented 5 years ago

Just pushed a number of edits:

Unless there are any objections, the Load.lazyStack is ready to roll.

tpietzsch commented 5 years ago

@acardona Please leave merging to master to me (or one of the Stephans) in the future. There was still some things I would like to have added to this (there is some things required to make volatile accesses work). I know this can be slow process but I need/want to keep on top of things.