ome / omero-blitz

Gradle project containing Ice remoting code for OMERO
https://www.openmicroscopy.org/omero
GNU General Public License v2.0
0 stars 15 forks source link

verifyUpload time for filesets with very large number of files #73

Closed sbesson closed 1 month ago

sbesson commented 4 years ago

This scenario was most prominent in the case of an IDR submission with a 90 images fileset constituted of 65K individual TIFF files. However, I expect it starts happening in typical filesets of 10K files as typically generated by HCS acquisition systems - see https://idr.openmicroscopy.org/webclient/?show=plate-6451 for a good public representative example.

This plate was imported using in-place import and skipping all extra import steps (checksum, minmax thumbnails. The fs importtime command reports that 2h of the import are spent during the upload process:

[sbesson@test71-omeroreadwrite ~]$ /opt/omero/server/OMERO.server/bin/omero fs importtime Fileset:2567051
Using session for demo@localhost:4064. Idle timeout: 10 min. Current group: Public
   upload time of 7556.14s for 34945 files (0.216s/file)
    setId time of 168.06s
 metadata time of 260.26s

Looking at the details of the import log, the bottleneck happens in theverifyUpload phase. More precisely, the following query:

https://github.com/ome/omero-blitz/blob/3b131c980fc88e040ca7981479b16775fe731add/src/main/java/ome/services/blitz/repo/ManagedImportProcessI.java#L340-L341

is executed in a loop against each file in the fileset. For large filesets, the individual query time starts to become non-neglectable e.g. ~173ms independently of the checksum algorithm and the checksum validation step has a time complexity of O(n2):

2019-06-25 08:47:35,042 INFO  [                 org.perf4j.TimingLogger] (.Server-13) start[1561452455042] time[0] tag[omero.services.file.checksum]
2019-06-25 08:47:35,043 INFO  [                 org.perf4j.TimingLogger] (.Server-11) start[1561452455042] time[0] tag[omero.services.file.checksum]
2019-06-25 08:47:35,062 INFO  [                 org.perf4j.TimingLogger] (.Server-15) start[1561452455061] time[0] tag[omero.services.file.checksum]
2019-06-25 08:47:35,142 INFO  [                 org.perf4j.TimingLogger] (l.Server-9) start[1561452455141] time[0] tag[omero.services.file.checksum]
2019-06-25 08:47:35,372 INFO  [                 org.perf4j.TimingLogger] (.Server-16) start[1561452455174] time[198] tag[omero.import.process.checksum]
2019-06-25 08:47:35,551 INFO  [                 org.perf4j.TimingLogger] (.Server-16) start[1561452455372] time[179] tag[omero.import.process.checksum]
2019-06-25 08:47:35,724 INFO  [                 org.perf4j.TimingLogger] (.Server-16) start[1561452455551] time[172] tag[omero.import.process.checksum]
2019-06-25 08:47:35,899 INFO  [                 org.perf4j.TimingLogger] (.Server-16) start[1561452455724] time[174] tag[omero.import.process.checksum]
2019-06-25 08:47:36,072 INFO  [                 org.perf4j.TimingLogger] (.Server-16) start[1561452455899] time[173] tag[omero.import.process.checksum]
2019-06-25 08:47:36,246 INFO  [                 org.perf4j.TimingLogger] (.Server-16) start[1561452456072] time[173] tag[omero.import.process.checksum]

Initial attempts with @mtbc to reduce the import time by creating an index for the originalfile table did not solve the problem

CREATE INDEX originalfile_path_name_index ON originalfile ((path || name));
mtbc commented 4 years ago

Just to check: Was that fix confirmed not to help even for an import started after the new index was created (as opposed to an already-running import?).

sbesson commented 4 years ago

Pretty sure it didn't help but I can re-test a comparative import soon early next week to be sure (and report the metrics here).

mtbc commented 4 years ago

Great, much appreciated.

sbesson commented 4 years ago

I have been retesting using a Docker based OMERO 5.5 and the idr0061 plate

[sbesson@idr2-slot3 omero-blitz-73]$ cat docker-compose.yml 
---
version: "3"

services:
  db:
    image: "postgres:10"
    environment:
      - POSTGRES_USER=omero
      - POSTGRES_DB=omero
      - POSTGRES_PASSWORD=omero

  omero:
    image: openmicroscopy/omero-server:5.5
    environment:
      - CONFIG_omero_db_host=db
      - CONFIG_omero_db_user=omero
      - CONFIG_omero_db_pass=omero
      - CONFIG_omero_db_name=omero
      - ROOTPASS=omero
      - CONFIG_Ice_IPv6=0
    volumes:
      - "/uod/idr:/uod/idr:ro"

From a fresh DB, ran two imports with the default DB configuration:

bash-4.2$ grep "omero.import.process.checksum" /OMERO/ManagedRepository/root_0/2019-11/25/10-52-09.395.log | tail -n 5
2019-11-25 11:15:06,805 INFO  [                 org.perf4j.TimingLogger] (.Server-17) start[1574680506796] time[8] tag[omero.import.process.checksum]
2019-11-25 11:15:06,813 INFO  [                 org.perf4j.TimingLogger] (.Server-17) start[1574680506805] time[8] tag[omero.import.process.checksum]
2019-11-25 11:15:06,823 INFO  [                 org.perf4j.TimingLogger] (.Server-17) start[1574680506814] time[9] tag[omero.import.process.checksum]
2019-11-25 11:15:06,831 INFO  [                 org.perf4j.TimingLogger] (.Server-17) start[1574680506823] time[8] tag[omero.import.process.checksum]
2019-11-25 11:15:06,840 INFO  [                 org.perf4j.TimingLogger] (.Server-17) start[1574680506831] time[8] tag[omero.import.process.checksum]
bash-4.2$ grep "omero.import.process.checksum" /OMERO/ManagedRepository/root_0/2019-11/25/11-42-43.866.log | tail -n 5
2019-11-25 12:10:38,047 INFO  [                 org.perf4j.TimingLogger] (.Server-19) start[1574683838032] time[14] tag[omero.import.process.checksum]
2019-11-25 12:10:38,063 INFO  [                 org.perf4j.TimingLogger] (.Server-19) start[1574683838047] time[15] tag[omero.import.process.checksum]
2019-11-25 12:10:38,077 INFO  [                 org.perf4j.TimingLogger] (.Server-19) start[1574683838063] time[13] tag[omero.import.process.checksum]
2019-11-25 12:10:38,091 INFO  [                 org.perf4j.TimingLogger] (.Server-19) start[1574683838077] time[14] tag[omero.import.process.checksum]
2019-11-25 12:10:38,105 INFO  [                 org.perf4j.TimingLogger] (.Server-19) start[1574683838091] time[14] tag[omero.import.process.checksum]

After running CREATE INDEX originalfile_path_name_index ON originalfile ((path || name)); and restarting the database container, re-imported the fileset:

bash-4.2$ grep "omero.import.process.checksum" /OMERO/ManagedRepository/root_0/2019-11/25/15-44-12.952.log | tail -n 5
2019-11-25 16:03:18,647 INFO  [                 org.perf4j.TimingLogger] (.Server-28) start[1574697798644] time[3] tag[omero.import.process.checksum]
2019-11-25 16:03:18,649 INFO  [                 org.perf4j.TimingLogger] (.Server-28) start[1574697798647] time[2] tag[omero.import.process.checksum]
2019-11-25 16:03:18,651 INFO  [                 org.perf4j.TimingLogger] (.Server-28) start[1574697798649] time[1] tag[omero.import.process.checksum]
2019-11-25 16:03:18,654 INFO  [                 org.perf4j.TimingLogger] (.Server-28) start[1574697798651] time[3] tag[omero.import.process.checksum]
2019-11-25 16:03:18,656 INFO  [                 org.perf4j.TimingLogger] (.Server-28) start[1574697798654] time[2] tag[omero.import.process.checksum]

Unlike my previous attempts it looks like the index had some effects on the bottleneck checksum process. I will try a few additional imports to reproduce the metrics above. If reproducible, I might retest the same workflow in a real IDR environment (i.e. with an existing real-world database).

mtbc commented 4 years ago

Certainly encouraging: if it keeps helping then I'll provisionally add this to my list of DB changes I want to make next time we are breaking it.

sbesson commented 4 years ago

Retested earlier today

bash-4.2$ grep "omero.import.process.checksum"  /OMERO/ManagedRepository/root_0/2019-11/26/12-30-02.485.log |tail -n 5
2019-11-26 12:50:15,800 INFO  [                 org.perf4j.TimingLogger] (.Server-44) start[1574772615798] time[2] tag[omero.import.process.checksum]
2019-11-26 12:50:15,803 INFO  [                 org.perf4j.TimingLogger] (.Server-44) start[1574772615800] time[3] tag[omero.import.process.checksum]
2019-11-26 12:50:15,807 INFO  [                 org.perf4j.TimingLogger] (.Server-44) start[1574772615803] time[3] tag[omero.import.process.checksum]
2019-11-26 12:50:15,810 INFO  [                 org.perf4j.TimingLogger] (.Server-44) start[1574772615807] time[3] tag[omero.import.process.checksum]
2019-11-26 12:50:15,813 INFO  [                 org.perf4j.TimingLogger] (.Server-44) start[1574772615810] time[3] tag[omero.import.process.checksum]
mtbc commented 4 years ago

:+1: I'm hoping that the larger the fileset, the more benefit from the index.