Closed dgault closed 11 months ago
Updated ZarrReader on idr-testing to today's build: OMEZarrReader_p681_b603.jar
omero import -d 18153 --depth=100 /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr --file /tmp/import_20231116.log --err /tmp/import_20231116.err
...
2023-11-16 13:28:59,517 807861 [3-thread-1] INFO ormats.importer.cli.LoggingImportMonitor - FILE_UPLOAD_STARTED: /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr/OME/METADATA.ome.xml
2023-11-16 13:28:59,750 808094 [3-thread-1] INFO ormats.importer.cli.LoggingImportMonitor - FILE_UPLOAD_COMPLETE: /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr/OME/METADATA.ome.xml
2023-11-16 13:29:05,157 813501 [2-thread-1] INFO ormats.importer.cli.LoggingImportMonitor - FILESET_UPLOAD_END
2023-11-16 13:29:05,477 813821 [2-thread-1] INFO ormats.importer.cli.LoggingImportMonitor - IMPORT_STARTED Logfile: 61676957
2023-11-16 13:29:10,368 818712 [l.Client-0] INFO ormats.importer.cli.LoggingImportMonitor - METADATA_IMPORTED Step: 1 of 5 Logfile: 61676957
2023-11-16 13:29:18,622 826966 [l.Client-1] INFO ormats.importer.cli.LoggingImportMonitor - PIXELDATA_PROCESSED Step: 2 of 5 Logfile: 61676957
2023-11-16 13:29:22,497 830841 [l.Client-0] INFO ormats.importer.cli.LoggingImportMonitor - THUMBNAILS_GENERATED Step: 3 of 5 Logfile: 61676957
2023-11-16 13:29:22,542 830886 [l.Client-1] INFO ormats.importer.cli.LoggingImportMonitor - METADATA_PROCESSED Step: 4 of 5 Logfile: 61676957
2023-11-16 13:29:22,576 830920 [l.Client-0] INFO ormats.importer.cli.LoggingImportMonitor - OBJECTS_RETURNED Step: 5 of 5 Logfile: 61676957
2023-11-16 13:29:22,782 831126 [l.Client-1] INFO ormats.importer.cli.LoggingImportMonitor - IMPORT_DONE Imported file: /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr/0/0/0/0/0/0/0
Other imported objects:
Fileset:6314846
==> Summary
444 files uploaded, 1 fileset created, 1 image imported, 0 errors in 0:13:35.110
In webclient, checking managed repo paths I see eg:
demo_2/Blitz-0-Ice.ThreadPool.Server-7/2023-11/16/13-15-52.390/a0e1b9c0-5c07-4600-b114-7b4697900e39.zarr/0/0/0/17/0/1/0
This removed the /bia-integrator-data/S-BIAD800/a0e1b9c0-5c07-4600-b114-7b4697900e39/
from the path as expected 👍 💯
@dgault @joshmoore @sbesson - What's the next steps towards getting this PR merged? I don't have any outstanding issues with this now. It's working fine on idr-testing for the mkngff data and @dominikl has been using it for importing idr0138 data (although I don't know which build he's been using - probably not the last commit?).
If we are ready to get this merged and released then it would be great to get that ticked off the TODO list.
It was the build from 13th Nov, if that helps. But I'll import a few images with the current build again.
@dgault @joshmoore @sbesson - What's the next steps towards getting this PR merged?
Based on the latest state of the conversation, it looks like no more changes are planned, is that correct? If so, any reason not to follow the standard approach i.e. assign a set of reviewers, get the PR approved, merge and release. My understanding is that between yourself and @dominikl, there has been a lot of functional review (and you should probably be reviewers). Do we also want to run some code review? Proposing we add this discussion to the upcoming planning meeting as I would also consider a ZarrReader tag as essential if we decide to spin up idr-next
/cc @francesw @jburel
Yeah I believe this should be ready for a final review now.
The one area that I can think of changing is the naming of the new options (https://github.com/ome/ZarrReader/pull/64/files#diff-71931eb54d3ee0b5a673044b8cdcdb8fa1bee03c583e3fc916e1b6b47a7dd7bbR93-R98). Currently as they have been added rather ad-hoc there is a mixture zarrreader and omezarr prefixes. So we probably want to devcide which we are keeping. Personally I would opt for OMEZarr as the preferred prefix.
Following the discussion today I have updated the options names to all use the omezarr
prefix. I have also changed the default behaviour to disable the performance improvements.
@will-moore, this will mean you will be required to add the following the the bfoptions file:
omezarr.quick_read=true
Updated the mkngff bfoptions
command at https://github.com/IDR/omero-mkngff/pull/13 and used this to recreate the .bfoptions
for all idr0004 NGFF data on idr0125-pilot.
Then manually removed the omezarr.quick_read=true
for idr0004 plates P115
and P124
that have different sized images.
Then updated the ZarrReader to today's build to include last commits above. Regenerated memo files for regular plate and P115
and memo files were regenerated and images viewable.
https://github.com/IDR/omero-mkngff/pull/13#issuecomment-1842256138
Looks good!
Performed a review of the latest state of the changes proposed here. I found no conceptual issue that would constitute a blocker but a few questions emerged as I was reading through the code:
- the signature of several private utility methods (
parseResolutionCount
,parseOmeroMetadata
,parseLabels
,parseImageLabels
) is updated to take the an attributes map as an extra input. For all these APIs, it looks like the first argumentroot
is no longer used by the API and could be completed removed from the signature? Since these are internal methods, this can be handled as a follow-up cleanup / patch release if necessary
Yeah that parameter is no longer needed and can be safely removed from those method signatures
- the
parsePlate
utility also takes anattr
map as an extra argument but this is now the first argument rather than the last, is this done on purpose?
No, there is no specific reason for that, it can safely be moved to be the last parameter
- the extra
openZarr
boolean added to the newsetSeries/setResolution
APIs conceptually looks similar toisThisType(String id, boolean open)
but I share some of the wider concerns in https://github.com/ome/ZarrReader/pull/64/files#r1350073451. Given the nature of the format, I would expect the reader to internalize the decision of whether to open attribute files or not to be able to fetch additional metadata while trying to keep the number of reads minimal. In other terms I think theomezarr.quick_read
is a requirement in order to have this used at all for IDR but I wonder whether the mid-term goal is to have this option removed completely and just let the reader do the right thing
I agree that the handling here is certainly not ideal and Im open to alternative suggestions. Unfortunately even in the quick read scenario we still need some of the calls to open the zarr and some to ignore it, so there has to be some logic handling that decision. The number of calls to setSeries
is extremely high and keeping it as bare bones as possible was a big part of the performance improvements. If there was a way to get the same performance gains without the extra method signature then that would be ideal.
When you say have the option removed, do you mean omezarr.quick_read
? That would likely be extremely difficult, as we see in instances such idr0004, if the data doesn't align with the assumptions that quick_read is making then the reader doesn't really have any way of verifying the correctness of the handling other than the original 'slow' read option.
- looking at the new options and their default value, my undertanding is that the two largest backwards-incompatible changes in the default behavior of the reader are that XML annotations for the JSON metadata are not created and label images are not treated as series anymore. Is that correct? I think both of these assumptions are completely acceptable to meet the IDR requirements but they should probably be communicated in the release notes to avoid surprises.
Yes that is correct, the handling of the labels is probably the biggest functional change and should definitely be covered in the release notes
When you say have the option removed, do you mean omezarr.quick_read? That would likely be extremely difficult, as we see in instances such idr0004, if the data doesn't align with the assumptions that quick_read is making then the reader doesn't really have any way of verifying the correctness of the handling other than the original 'slow' read option.
Yes that's what I was referring to. Definitely not a blocker given the IDR timeline but I think that post-release, it might be useful to come back to this use case and understand what makes idr0004
different from other Zarr datasets in the sense that it cannot use the assumptions of the omezarr.quick_read
option.
Should this alongside the API feedback be captured as issues and addressed in follow-up releases?
Opened an Issue to capture the feedback here: https://github.com/ome/ZarrReader/issues/68
Getting this merged as it seems everyone is happy with the testing and handling the API feedback as part of the follow up issue.
Initial performance improvements for ZarrReader related to https://github.com/ome/ZarrReader/issues/63
This PR is an initial PR for testing purposes, in its current state it looks to reduce the initFile time by about 50%, hopefully further improvements can be made.
Based on a subset of idr0036 the initial profiling looked as below,
The updated profiling should look more like this (the Zarrutils.toJSON should also be gone). If we can manage to remove some of the remaining JZarr calls then this may be reduced further:
The initial fixes here target the below areas: