ZarrReader performance improvements

dgault commented 1 year ago

Initial performance improvements for ZarrReader related to https://github.com/ome/ZarrReader/issues/63

This PR is an initial PR for testing purposes, in its current state it looks to reduce the initFile time by about 50%, hopefully further improvements can be made.

Based on a subset of idr0036 the initial profiling looked as below,

The updated profiling should look more like this (the Zarrutils.toJSON should also be gone). If we can manage to remove some of the remaining JZarr calls then this may be reduced further:

The initial fixes here target the below areas:

Reduce the unnecessary parsing and storage of attribute files in XMLAnnotations. This should also reduce the size of the memo file. An option has been added to allow this to be configurable if needs be.
Reduce the number of calls JZarr to open the .zarray. These calls were taking a large portion of the initialization time, mostly occurring on setSeries. I have tried to reduce this down but it is atill required for retrieving the shape of each array to populate the dimension sizes.
Reduce the number of calls JZarr to fetch group attributes. These calls were slow and as such I have tried to reduce the level of duplication. Hopefully I can get this down further still.

will-moore commented 11 months ago

Updated ZarrReader on idr-testing to today's build: OMEZarrReader_p681_b603.jar

omero import -d 18153 --depth=100 /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr --file /tmp/import_20231116.log --err /tmp/import_20231116.err

...
2023-11-16 13:28:59,517 807861     [3-thread-1] INFO   ormats.importer.cli.LoggingImportMonitor - FILE_UPLOAD_STARTED: /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr/OME/METADATA.ome.xml
2023-11-16 13:28:59,750 808094     [3-thread-1] INFO   ormats.importer.cli.LoggingImportMonitor - FILE_UPLOAD_COMPLETE: /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr/OME/METADATA.ome.xml
2023-11-16 13:29:05,157 813501     [2-thread-1] INFO   ormats.importer.cli.LoggingImportMonitor - FILESET_UPLOAD_END
2023-11-16 13:29:05,477 813821     [2-thread-1] INFO   ormats.importer.cli.LoggingImportMonitor - IMPORT_STARTED Logfile: 61676957
2023-11-16 13:29:10,368 818712     [l.Client-0] INFO   ormats.importer.cli.LoggingImportMonitor - METADATA_IMPORTED Step: 1 of 5  Logfile: 61676957
2023-11-16 13:29:18,622 826966     [l.Client-1] INFO   ormats.importer.cli.LoggingImportMonitor - PIXELDATA_PROCESSED Step: 2 of 5  Logfile: 61676957
2023-11-16 13:29:22,497 830841     [l.Client-0] INFO   ormats.importer.cli.LoggingImportMonitor - THUMBNAILS_GENERATED Step: 3 of 5  Logfile: 61676957
2023-11-16 13:29:22,542 830886     [l.Client-1] INFO   ormats.importer.cli.LoggingImportMonitor - METADATA_PROCESSED Step: 4 of 5  Logfile: 61676957
2023-11-16 13:29:22,576 830920     [l.Client-0] INFO   ormats.importer.cli.LoggingImportMonitor - OBJECTS_RETURNED Step: 5 of 5  Logfile: 61676957
2023-11-16 13:29:22,782 831126     [l.Client-1] INFO   ormats.importer.cli.LoggingImportMonitor - IMPORT_DONE Imported file: /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr/0/0/0/0/0/0/0
Other imported objects:
Fileset:6314846

==> Summary
444 files uploaded, 1 fileset created, 1 image imported, 0 errors in 0:13:35.110

In webclient, checking managed repo paths I see eg:

demo_2/Blitz-0-Ice.ThreadPool.Server-7/2023-11/16/13-15-52.390/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr/0/0/0/17/0/1/0

This removed the /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/ from the path as expected 👍 💯

will-moore commented 11 months ago

@dgault @joshmoore @sbesson - What's the next steps towards getting this PR merged? I don't have any outstanding issues with this now. It's working fine on idr-testing for the mkngff data and @dominikl has been using it for importing idr0138 data (although I don't know which build he's been using - probably not the last commit?).

If we are ready to get this merged and released then it would be great to get that ticked off the TODO list.

dominikl commented 11 months ago

It was the build from 13th Nov, if that helps. But I'll import a few images with the current build again.

sbesson commented 11 months ago

@dgault @joshmoore @sbesson - What's the next steps towards getting this PR merged?

Based on the latest state of the conversation, it looks like no more changes are planned, is that correct? If so, any reason not to follow the standard approach i.e. assign a set of reviewers, get the PR approved, merge and release. My understanding is that between yourself and @dominikl, there has been a lot of functional review (and you should probably be reviewers). Do we also want to run some code review? Proposing we add this discussion to the upcoming planning meeting as I would also consider a ZarrReader tag as essential if we decide to spin up idr-next /cc @francesw @jburel

dgault commented 11 months ago

Yeah I believe this should be ready for a final review now.

The one area that I can think of changing is the naming of the new options (https://github.com/ome/ZarrReader/pull/64/files#diff-71931eb54d3ee0b5a673044b8cdcdb8fa1bee03c583e3fc916e1b6b47a7dd7bbR93-R98). Currently as they have been added rather ad-hoc there is a mixture zarrreader and omezarr prefixes. So we probably want to devcide which we are keeping. Personally I would opt for OMEZarr as the preferred prefix.

dgault commented 11 months ago

Following the discussion today I have updated the options names to all use the omezarr prefix. I have also changed the default behaviour to disable the performance improvements.

@will-moore, this will mean you will be required to add the following the the bfoptions file:

omezarr.quick_read=true

will-moore commented 11 months ago

Updated the mkngff bfoptions command at https://github.com/IDR/omero-mkngff/pull/13 and used this to recreate the .bfoptions for all idr0004 NGFF data on idr0125-pilot. Then manually removed the omezarr.quick_read=true for idr0004 plates P115 and P124 that have different sized images.

Then updated the ZarrReader to today's build to include last commits above. Regenerated memo files for regular plate and P115 and memo files were regenerated and images viewable. https://github.com/IDR/omero-mkngff/pull/13#issuecomment-1842256138

Looks good!

dgault commented 11 months ago

Performed a review of the latest state of the changes proposed here. I found no conceptual issue that would constitute a blocker but a few questions emerged as I was reading through the code:

the signature of several private utility methods (parseResolutionCount, parseOmeroMetadata, parseLabels, parseImageLabels) is updated to take the an attributes map as an extra input. For all these APIs, it looks like the first argument root is no longer used by the API and could be completed removed from the signature? Since these are internal methods, this can be handled as a follow-up cleanup / patch release if necessary

Yeah that parameter is no longer needed and can be safely removed from those method signatures

the parsePlate utility also takes an attr map as an extra argument but this is now the first argument rather than the last, is this done on purpose?

No, there is no specific reason for that, it can safely be moved to be the last parameter

the extra openZarr boolean added to the new setSeries/setResolution APIs conceptually looks similar to isThisType(String id, boolean open) but I share some of the wider concerns in https://github.com/ome/ZarrReader/pull/64/files#r1350073451. Given the nature of the format, I would expect the reader to internalize the decision of whether to open attribute files or not to be able to fetch additional metadata while trying to keep the number of reads minimal. In other terms I think the omezarr.quick_read is a requirement in order to have this used at all for IDR but I wonder whether the mid-term goal is to have this option removed completely and just let the reader do the right thing

I agree that the handling here is certainly not ideal and Im open to alternative suggestions. Unfortunately even in the quick read scenario we still need some of the calls to open the zarr and some to ignore it, so there has to be some logic handling that decision. The number of calls to setSeries is extremely high and keeping it as bare bones as possible was a big part of the performance improvements. If there was a way to get the same performance gains without the extra method signature then that would be ideal.

When you say have the option removed, do you mean omezarr.quick_read? That would likely be extremely difficult, as we see in instances such idr0004, if the data doesn't align with the assumptions that quick_read is making then the reader doesn't really have any way of verifying the correctness of the handling other than the original 'slow' read option.

looking at the new options and their default value, my undertanding is that the two largest backwards-incompatible changes in the default behavior of the reader are that XML annotations for the JSON metadata are not created and label images are not treated as series anymore. Is that correct? I think both of these assumptions are completely acceptable to meet the IDR requirements but they should probably be communicated in the release notes to avoid surprises.

Yes that is correct, the handling of the labels is probably the biggest functional change and should definitely be covered in the release notes

sbesson commented 11 months ago

When you say have the option removed, do you mean omezarr.quick_read? That would likely be extremely difficult, as we see in instances such idr0004, if the data doesn't align with the assumptions that quick_read is making then the reader doesn't really have any way of verifying the correctness of the handling other than the original 'slow' read option.

Yes that's what I was referring to. Definitely not a blocker given the IDR timeline but I think that post-release, it might be useful to come back to this use case and understand what makes idr0004 different from other Zarr datasets in the sense that it cannot use the assumptions of the omezarr.quick_read option.

Should this alongside the API feedback be captured as issues and addressed in follow-up releases?

dgault commented 11 months ago

Opened an Issue to capture the feedback here: https://github.com/ome/ZarrReader/issues/68

dgault commented 11 months ago

Getting this merged as it seems everyone is happy with the testing and handling the API feedback as part of the follow up issue.

ome / ZarrReader

ZarrReader performance improvements #64