ome / omero-blitz

Gradle project containing Ice remoting code for OMERO
https://www.openmicroscopy.org/omero
GNU General Public License v2.0
0 stars 15 forks source link

Improve checksum validation performance for large filesets #150

Closed sbesson closed 3 weeks ago

sbesson commented 1 month ago

Fixes #73

The currently implementation of the checksum validation in ManagedImportProcessI.verifyUpload() retrieves the server hashes by executing as many HQL queries as the number of OriginalFile objects in the fileset. For filesets containing 1K-100K files which is fairly common in the high-content screening domains and individual queries taking ~100-200ms, this can lead to multiple hours of checksum verification.

This commit updates the logic to use a single HQL query and create a hash map of all serverHashes for the fileset indexed by their originalfile path/name. This map is then used in the following loop to compare each hash to the client checksum.

The stopwatch is also updated to measure the overall time of the checksum validation process as the individual server/client comparisons in the loop should now take under 1ms.

HCS acquisitions containing 1-10K files are good candidates for testing this change. Example of public representative plates can be found under idr0006, BBBC017.

Client-side, the checksum verification time corresponds to the time between the last FILE_UPLOAD_COMPLETE: statement and the FILESET_UPLOAD_END statement in the command-line importer logs. Post-import, the overall upload time can also be retrieved post import using omero fs importtime Fileset:<id>.

Without this PR, the validation time will increase linearly with the number of files and be in the order of hours for very large filesets. With this PR, this process should take typically less than a second.

joshmoore commented 1 month ago

this can lead to multiple hours of checksum verification

🤯 👏🏽

pwalczysko commented 1 month ago

With this PR (on merge-ci)

omero import --transfer=ln_s /uod/idr/repos/curated/metaxpress/public/idr0006/plate\ 11001_Plate_136/plate\ 11001.HTD
...
2024-07-17 16:50:36,103 973098     [3-thread-1] INFO   ormats.importer.cli.LoggingImportMonitor - FILE_UPLOAD_COMPLETE: /uod/idr/repos/curated/metaxpress/public/idr0006/plate 11001_Plate_136/TimePoint_1/plate 11001_H11_s9_w2.TIF
2024-07-17 16:50:36,899 973894     [2-thread-1] INFO   ormats.importer.cli.LoggingImportMonitor - FILESET_UPLOAD_END
2024-07-17 16:50:38,023 975018     [2-thread-1] INFO   ormats.importer.cli.LoggingImportMonitor - IMPORT_STARTED Logfile: 24508

Edit: as this timinig is as advertised in the header of the PR ^^^ (less than a sec) and the import did not error, approving

user-3 https://merge-ci.openmicroscopy.org/web/webclient/?show=well-5820