Import fails for large CIF files

lahr-ul commented 3 years ago

We want to import CIF files of varying size in OMERO with the importer. This process works seamlessly for small files (e.g. 16MB) but not for large files (e.g. 192MB). The import of a small file takes about 20 seconds and the import of a large file times out after several hours (user session and/or Hibernate session). The issue can be reproduces on a production system and a local docker compose setup.

Here is an excerpt of the log:

2021-06-17 13:09:54,223 INFO  [    ome.formats.OMEROMetadataStoreClient] (1-thread-2) Handling # of references: 40000
2021-06-17 13:09:54,225 INFO  [    ome.security.basic.BasicEventContext] (.Server-78)  cctx:    group=3
2021-06-17 13:09:54,227 INFO  [         ome.security.basic.EventHandler] (.Server-78)  Auth:    user=53,group=3,event=null(User),sess=5b5cab00-7a3f-4883-8d85-925f8c12abb0
2021-06-17 13:09:54,240 INFO  [                 org.perf4j.TimingLogger] (.Server-78) start[1623935394224] time[15] tag[omero.call.success.ome.services.blitz.impl.MetadataStoreI$5.doWork]
2021-06-17 13:09:54,240 INFO  [    ome.formats.OMEROMetadataStoreClient] (1-thread-2) Starting referenceBatch #2

There are multiple Starting referenceBatch # statements, the importer hangs at "importing metadata" and after several hours there is a timeout.

We also tried to change some configuration values without success:

# Database 
omero.db.poolsize=100
# Memory
omero.jvmcfg.strategy=percent
omero.jvmcfg.max_system_memory=64000
omero.jvmcfg.percent.blitz=40
omero.jvmcfg.percent.indexer=10
omero.jvmcfg.percent.pixeldata=30
omero.jvmcfg.percent.repository=10
omero.jvmcfg.heap_size=8000
omero.pixeldata.threads=10
omero.threads.min_threads=10

sbesson commented 3 years ago

Hi @lahr-ul we have been facing a similar issue while trying to load CIF files in the context of an IDR submission.

My suspicion is that the problem is related to the number of objects in the file, which can easily reach several 10K in this cytometry format. Do you know how many individual images (Bio-Formats series) are contained in the file?

lahr-ul commented 3 years ago

Hi @lahr-ul we have been facing a similar issue while trying to load CIF files in the context of an IDR submission.

My suspicion is that the problem is related to the number of objects in the file, which can easily reach several 10K in this cytometry format. Do you know how many individual images (Bio-Formats series) are contained in the file?

About 70K. Sometimes smaller files with about 60MB also fail.

sbesson commented 3 years ago

Sorry for dropping the ball. Understood and the large number of images (>10K) is most likely the reason for the hanging metadata due to the huge number of objects to be inserted into the database (typically ~10 / image so we are talking about 1M rows insertion).

We dealt with very similar scalability issues in the case of high-content screening datasets, which have similar number of images in the 1-100K range. The database bottlenecks have been mitigated by a series of optimizations like collapsing some of the elements e.g. https://github.com/ome/openmicroscopy/pull/3261.

The only immediate workaround I can think of would be to export the CIF series as individual images e.g. using bfconvert or bioformats2raw and import the images individually. To be able to natively import these filesets, I suspect we need to identify the elements that are duplicated and could be reduced if possible.

ome / omero-blitz

Import fails for large CIF files #118