google / tensorstore

Library for reading and writing large multi-dimensional arrays.
https://google.github.io/tensorstore/
Other
1.34k stars 120 forks source link

Adding driver for Tiff images #37

Open sameeul opened 2 years ago

sameeul commented 2 years ago

Hello there. I am trying to see if I can add support for reading (and eventually writing) Tiff images using tensorstore. The motivation is to be able to read/write large OMETiff images. Initially I am following the png driver example to create an interface with libtiff library. But, longer term, I would prefer to do the IO in a chunked fashion. Since Tiff has the concept of tiles (X and Y axis) and IFDs (Z axis and additional dimensions, at least according to OME spec), I think chucked IO will work similar to zarr driver.

I am looking for some guidance for getting started. For example, if I have something like the following, how does information of which file need to be opened gets to the driver?

  tensorstore::Context context = Context::Default();
  TENSORSTORE_CHECK_OK_AND_ASSIGN(auto store, tensorstore::Open({{"driver", "tiff"},
                            {"kvstore", {{"driver", "file"},
                                         {"path", "p01_x01_y01_wx0_wy0_c1.ome.tif"}}
                            }},
                            context,
                            tensorstore::OpenMode::open,
                            tensorstore::RecheckCached{false},
                            tensorstore::ReadWriteMode::read).result());
jbms commented 2 years ago

A tiff driver would certainly be a welcome contribution.

The driver is responsible for parsing its own json spec --- therefore is ultimately responsible for e.g. parsing the kvstore member as a kvstore::Spec. However, for the existing image drivers there is an additional intermediate layer that helps with that by providing an abstraction for TensorStore drivers backed by a single file: https://github.com/google/tensorstore/blob/master/tensorstore/driver/image/driver_impl.h

As a starting point I'd suggest to base a tiff driver on one of the existing image drivers.

Supporting chunked I/O as in the other chunked formats supported by tensorstore is possible but would require some implementation changes. A possible middle ground that would work for the file kvstore driver would be to modify the kvstore driver to open large files using mmap, such that the returned Cord just references the memory-mapped file content. Then the operating system would take care of faulting in pages as required, without needing to do anything special in the tiff driver.

jbms commented 2 years ago

To clarify, in regards to chunked I/O, are you talking about reading just portions of a single tiff file, or are you talking about a volume that is represented by a grid of individual tiff files?

sameeul commented 2 years ago

For the chunked IO, it is a single tiff file but may contain multiple Image File Directory (IFD). Each IDF can contain a 2D grid of tiles and each tile data can be accessed using LoadTileFromFile(...) interface of libtiff library.

Thanks for the explanation. One of the use case is to read parts of large OMETiff images (without loading the whole image into memory).

jbms commented 2 years ago

In regards to libtiff specifically, I think one challenge is that for a given TIFF object, you can only read a single tile at a time, because the TIFF object is not thread safe. In order to allow reading multiple tiles in parallel, you would need to somehow create multiple TIFF objects.

sameeul commented 2 years ago

So, some updates here. I was able to mimic the other images drivers and read a whole tiff image. I found that current image drivers are only supporting rank 3 and uint8_t data. I just did some ugly hack to get around that for the test purpose. However, as I mentioned earlier, I am interested in reading the data in chunk since in the actual use case, OMETiff files can be huge and the users will probably be interested in a portion of the data at a given time. I understand that TIFF objects are not thread safe. Given that, does the chunk drivers will allow the flexibility to have multiple TIFF objects to be opened and use them as-needed basis?

jbms commented 2 years ago

Great that you at least got it to work --- the rank 3 and uint8 limitation was intended to be relaxed as soon as another image format, like tiff, was would require something different.

TensorStore's architecture would definitely support using multiple TIFF objects --- you would want to build the driver on top of ChunkCache. The issue with multiple TIFF objects is just that you probably have to manage a pool of them, and find a way to initialize the additional TIFF objects as efficiently as possible (e.g. without having to actually re-read the header data).

sameeul commented 2 years ago

Some more updates: I am at a point where I mimicked n5 and zarr implementation and created ometiff driver. For this driver, I implemented MetadataCache, DataCache, OmeTiffDriver and OmeTiffDriverSpec class. I also created necessary scaffolding similar to zarr and n5.

The code compiles w/o any issue but I get a linker error while trying to build //tensorstore/driver/ometiff:driver. I get the following error message:

/usr/bin/ld.gold: error: bazel-out/k8-fastbuild/bin/tensorstore/driver/ometiff/_objs/driver/driver.pic.o: requires dynamic R_X86_64_PC32 reloc against '_ZTVN11tensorstore16internal_ometiff12_GLOBAL__N_117OmeTiffDriverSpecE' which may overflow at runtime; recompile with -fPIC

I do not see the other drivers using -fPIC flag. This could be something due to my unfamiliarity with bazel build system. I am missing something obvious? What should I look for to locate the source of the issue?

jbms commented 2 years ago

That's impressive progress, and it would be great to incorporate that upstream once you have it working, if you are open to contributing. Regarding the link error, I am not sure what the cause is. Normally flags like that should not need to be specified directly. If you post a link to your code and describe your build environment and bazel command I can look into it though.

sameeul commented 2 years ago

Hello again, I had to take break for this work to take care of few other priorities. Now, I am restarting the work. My attempt on getting a ometiff driver working is here: https://github.com/sameeul/tensorstore/tree/ometiff_driver . Under the tensorstore/driver/ometiff directory, I added necessary scaffolding. I got the linker error resolved. But I am still struggling with a lot of implementation details. I am still struggling to understand all the design concepts with the driver. I am fairly new at modern C++. I am not sure if any of you have some time to meet with me to give me some more concrete direction to get me going. Thanks :-)

laramiel commented 2 years ago

I have submitted a small "tiff" driver to read tiff images. It does not use OMETIFF metadata, so it's restricted to essentially a single frame of a tiff file seen as an (y, x, c) volume.

You can see the example in driver/stack/image_stack_test, for example, to construct a more complex image based volume. For now these are read-only.

Example simple spec:

spec={
     "driver":"tiff",
     "kvstore":"file:///home/data/myfile.tiff",
     "domain": { "labels": ["y", "x", "c"] }
   }

You can also look at examples/extract_slice.cc for ideas.

sameeul commented 2 years ago

Hi there, thanks for looking into it. I have also made some progress and here is where I am currently stuck at. I will try to explain my issue below and please feel free to ask follow-up questions. We are really interested to contribute this work to tensorstore.

For OMETIFF, the convention is to use tiled tiff data and users typically requests only an small portion of the data that consists of a single tile (or part of a single tile) or multiple tiles. I am trying to use the concept of tile as analogous to chunk in tensorstore.

In the chunk version, I see that both zarr and n5 uses GetChunkStorageKey to get the actual file under the root directory. Then ChunkCache::Read reads that file and puts in a buffer. Later DecodeChunk takes that buffer and does necessary manipulation to get the actual n-dimensional data from that chunk.

In my case, since there is only a single file involved, in GetChunkStorageKey, I return the actual tiff file name. Then inside DecodeChunk, I am given the whole file in the buffer. There are two issues with that. First, for any chunk, the whole file is getting read (defeating the purpose of chunk reading, which will become an issue reading really large tiff files). Secondly, I loose the chunk indices after GetChunkStorageKey. This prevents me from reading a particular tile from the file (the tile indices will be same as the chunk indices in my case).

So, what I want is that inside DecodeChunk, the input buffer only holds the data related to a single tile (corresponding chunk). I understand that for zarr and n5, we are reading a single chunk file and so, the input buffer only contains that chunk data. How can we achieve something similar for tiled tiff file, even though we only have a single file containing all the chunks.

Thanks!

jbms commented 2 years ago

The zarr and n5 and neuroglancer_precomputed drivers are based on an intermediate kvs_backed_chunk_driver. For this you would instead probably need to use ChunkCache directly, since kvs_backed_chunked_driver is not appropriate for this use case.

sameeul commented 2 years ago

Thanks for the comment. I have a solution that works for now (not sure if this is the optimum one). What I understood from the other drivers and key-value stores is that I need a new key-value store that works on a single file but can access only chunks of the file (tiles, in our case). I got that working . Now, I am trying to understand the caching mechanism. For example, if I access a chunk and extract the data for a particular Read request and then another Read requires the same chunk, how to we reuse the already cached chunk?

I see that file_key_value_store.cc has some sort of staleness check around line 355. Do I need something similar?

laramiel commented 1 year ago

It might be instructive to look at the VirtualChunked driver. In fact, as you get things to work, that may provide a reasonable framework for experimentation.

A driver provides a schema--that is, the dimensions, data-type, data-source, etc. along with the data. Data acquisition is built around a cache, usually something like tensorstore::internal::ChunkCache. The Cache::Entry is created before the data is acquired, but it has some location/topology information. Concretely, ChunkCache::Entry has the cell_indices which the driver can then use to acquire the actual underlying data object. So for this to work with OMETIFF, you'd want something like that--some way to identify the TIFF tile that is being loaded (which might include the directory & tile index, for example).

Once a chunk is allocated in the cache, the Cache::Entry::DoRead method will be called. That method typically dispatches the read to the underlying kvstore to asynchronously acquire the data & version information. For example, coming back to the VirtualChunked driver, the DoRead method allocates an array and then calls into the user-provided function to populate the data.

1kaiser commented 1 year ago

important for processing satellite images as theycome in GEOTiff format >>> example use gdal.py >>> for given two satellite image BANDS band 4 and band 6>>>

from osgeo import gdal

creating file NDSI

    !gdal_calc.py \
      --overwrite \
      --type=Float32 \
      -A {image_dir}{pathb4} \
      --A_band 1 \
      -B {image_dir}{pathb6} \
      --B_band 1 \
      --outfile={temp_dir}"NDSI_result.tif" \
      --calc="(A.astype(float) - B)/(A.astype(float) + B)"
sameeul commented 1 year ago

I opened a PR with OmeTiff read support. I have plan to support the write functionality also. I would like to get feedback on the PR. I mimicked the n5 driver unit tests and there are some tests that I could not make it pass. I am guessing there are some obvious basic things that I am missing and the maintainers experience eye can find them. The following Python code now works.

import tensorstore as ts

dataset_future = ts.open({ 
                    'driver'    : 'ometiff',
                    'kvstore'   : { 
                        'driver' : 'tiled_tiff',
                        'path' : '/some_file.ome.tif',
                    },

                    'context': {
                        'cache_pool': {
                            'total_bytes_limit': 100_000_000
                        }
                    },
                })

dataset = dataset_future.result()
print(dataset)
data = dataset[0,0,0,0:1024,0:1024].read()
joelyancey commented 1 year ago

I opened a PR with OmeTiff read support. I have plan to support the write functionality also. I would like to get feedback on the PR. I mimicked the n5 driver unit tests and there are some tests that I could not make it pass. I am guessing there are some obvious basic things that I am missing and the maintainers experience eye can find them. The following Python code now works.

This would be very useful. I will be following the progress, thanks guys.

mkitti commented 4 months ago

One can turn a tiled TIFF file into a Zarr v3 shard by extracting the tile offsets and sizes and appending them to the file. That would at least provide read access.

https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html