Support partial fetches for s3 backend

pontus commented 1 year ago

As a service consumer/end user I want to be able to fetch parts of an object (file) using a range To be able to do "random I/O" in an object To support fetching just the needed intermediate/baseline images for what I want to see in a DICOM WSI

A/C: [ ] Server reports bytes in Accept-Ranges header [ ] Server (as binary) only retrieves the data necessary to satisfy the request [ ] Server correctly retrieves the correct data for a given Range header (this may be one or several ranges requested).

This likely has bearing on at least Crypt4GH.

The current main branch seems to have support for coordinates that I guess have a similar purpose, but I'm a bit unsure about that functionality, but I don't see any indication it allows for better handling on the backend. It also has the byte range in the URL rather than through standard HTTP range requests.

blankdots commented 1 year ago

small comment to this crypt4gh ranges were designed to work with file formats like bam, cram as the header of those files can contain informed that can be used to identify e.g. Chr 2. Does DICOM has this kind of header information?

pontus commented 1 year ago

It's complicated :(

It's a format that is based on bag/tag, and there is no (mandatory) simple lookup table. I don't know all the details, but have noted that one of the listed rationales for WSI DICOMs tiled pyramid approach is to support only loading the needed tile(s) for what's should be displayed.

The DICOM format is also designed to support doubling as TIFFs ("dual-personality"), I haven't checked how those will work either but they might use have more explicit references (although I doubt it).

In practice my impression is this need no be very bad although one may need to do a few touchdowns at the start.

erikogabrielsson commented 1 year ago

For all bigpicture wsi dicom files there should be a (e.g. we require) an offset table. As Pontus say, the format is based on tag--value attributes, and the offset table is either in the Extended Offset Table attribute or the first item in the Pixel Data Attribute.

In a file the attributes are ordered by tag number, and the Extended offset table and Pixel data is last (more or less), in the 74E0-group. For fast access one would want to be able to read the file up to the 7FE0-group to get all the metadata, and then use the offset tables to read the frames one needs.

erikogabrielsson commented 1 year ago

And for dual-Tiff, that is something that needs to be prepared when creating the file. This would help parsing out the positions of the tiles for applications that supports TIFF but not DICOM, but the tiff header would not contain all the metadata (unless one for example translates dicom tags to tiff tags or dumps the DICOM tags as json into a tiff tag). We have chosen not to include this in the bigpicture dicom wsi standard.

erikogabrielsson commented 1 year ago

But it would be possible to create a simple script that parses out the needed positions to a separate index file, if that would help?

pontus commented 1 year ago

Without speaking for Stefan, I believe this question was to assess how useful this support would be in practice (a backend that supports random access does not help much if the file format requires reading the entire file up to the desired data).

I don't see any need for additional tooling right now. It would be nice seeing if the planned services (e.g. Cytomine) reads file in a good way (in contrast to reading the entire file), but as that is still under wraps as far as I understand, I believe there's not much to do for now.

erikogabrielsson commented 1 year ago

Cytomine uses WsiDicom, which wants to read the file up to the 7FE0-group first, then load the offset table on first tile access, and then tiles. WsiDicom is designed to allow parsing other sources than DICOM-on-file, so it could be possible to make WsiDicom read the encrypted files directly.

blankdots commented 1 year ago

just to mention it here: http://samtools.github.io/hts-specs/crypt4gh.pdf we need to do what is mentioned in section 4.2.

meaning when we encrypt the file these packages need to be added

pontus commented 1 year ago

Cytomine uses WsiDicom

I didn't know that, and then I have questions :(

Integrating python code in Java and bridging between those is certainly possibly, but not exactly a great time, so my guess would be that such use is to convert/import into something else (which would make the utility of this feature much less valuable).

pontus commented 1 year ago

just to mention it here: http://samtools.github.io/hts-specs/crypt4gh.pdf we need to do what is mentioned in section 4.2.

meaning when we encrypt the file these packages need to be added

I don't believe I understand this - actual encryption would be done client side before uploading, and server side reencryption would only be header replacement which needs not care about edit lists.

Server-side decryption (to serve unencrypted) would need to support parsing edit lists (this is already the case I believe for the golang crypt4gh code). What I think the golang code is currently missing is supporting special handling of a ReadSeeker and not reading unnecessary contents if that's what's passed.

blankdots commented 1 year ago

What I think the golang code is currently missing is supporting special handling of a ReadSeeker and not reading unnecessary contents if that's what's passed.

yes

neicnordic / sda-download

Support partial fetches for s3 backend #252