ome / omero-cli-zarr

https://pypi.org/project/omero-cli-zarr/
GNU General Public License v2.0
15 stars 10 forks source link

Fix --bf option for bioformats2raw 0.3.0 #75

Closed sbesson closed 3 years ago

sbesson commented 3 years ago

The breaking changes of bioformat2raw 0.3.0 has broken omero zarr export --bf. These commits provide the minimal set of changes to restore the command functionality. This currently only support bioformats2raw 0.3.0 and drops support for 0.2.x but see below for the discussion on the flag support:

Tested on pilot-zarr1-dev with omero --debug DEBUG zarr export --bf Image:13422206.

This initial work raises a few high-level questions on the future of this option. As a general rule, most of the support and testing has been added to the OMERO-only export workflow. This means most of the nice recent features do not apply to the --bf option like the rendering setting in the Zarr metadata, the HCS support... Additionally the layout of the OMERO export is different from the bioformats2raw export since the latter works at the fileset level rather than the image level.

I can conceive two general export strategies:

sbesson commented 3 years ago

As an initial data point for discussion, together with https://github.com/IDR/deployment/pull/343 and https://github.com/openmicroscopy/management_tools/pull/1458, I tested this PR against https://idr.openmicroscopy.org/webclient/img_detail/8343617 i.e. a fairly large pixel volume with minimal metadata overhead

(zarr) [sbesson@pilot-zarr1-dev data]$ time omero --debug DEBUG zarr export --bf Image:8343617
Using session for public@idr.openmicroscopy.org:4064. Idle timeout: 10 min. Current group: Public
bioformats2raw /nfs/bioimage/drop/idr0066-voigt-mesospim/20190821-ftp/ExperimentD/data/ExpD_chicken_embryo_stitched.ome.tif /data/ExpD_chicken_embryo_stitched.ome.tif
OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/opencv_openpnp11206441239058845902/nu/pattern/opencv/linux/x86_64/libopencv_java342.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

Image exported to /data/ExpD_chicken_embryo_stitched.ome.tif

real    25m18.067s
user    53m21.899s
sys     2m22.052s
(zarr) [sbesson@pilot-zarr1-dev data]$ omero login public@idr.openmicroscopy.org -w public
Previous session expired for public on idr.openmicroscopy.org:4064
Created session for public@idr.openmicroscopy.org:4064. Idle timeout: 10 min. Current group: Public
(zarr) [sbesson@pilot-zarr1-dev data]$ time omero --debug DEBUG zarr export Image:8343617
Using session for public@idr.openmicroscopy.org:4064. Idle timeout: 10 min. Current group: Public
Exporting to 8343617.zarr (0.2)
Finished.

real    59m24.492s
user    42m19.202s
sys     6m6.100s

This gives a two-fold improvement in the conversion speed. Interestingly the generated sizes are different

(base) [sbesson@pilot-zarr1-dev ~]$ du -csh /data/8343617.zarr/
52G /data/8343617.zarr/
52G total
(base) [sbesson@pilot-zarr1-dev ~]$ du -csh /data/ExpD_chicken_embryo_stitched.ome.tif/
69G /data/ExpD_chicken_embryo_stitched.ome.tif/
69G total

Looking at the internal structure, the most noticeable difference might come from the chunk size. bioformats2raw uses 1024x1014 while omero zarr export seems to use the full plane dimensions by default

(base) [sbesson@pilot-zarr1-dev ~]$ diff -wu /data/ExpD_chicken_embryo_stitched.ome.tif_default/0/0/.zarray /data/8343617.zarr/0/.zarray
--- /data/ExpD_chicken_embryo_stitched.ome.tif_default/0/0/.zarray  2021-08-12 12:39:20.579408467 +0000
+++ /data/8343617.zarr/0/.zarray    2021-08-12 13:19:01.616102479 +0000
@@ -3,17 +3,18 @@
     1,
     1,
     1,
-    1024,
-    1024
+        4491,
+        3540
   ],
   "compressor" : {
-    "clevel" : 5,
     "blocksize" : 0,
-    "shuffle" : 1,
+        "clevel": 5,
     "cname" : "lz4",
-    "id" : "blosc"
+        "id": "blosc",
+        "shuffle": 1
   },
-  "dtype" : ">u2",
+    "dimension_separator": "/",
+    "dtype": "<u2",
   "fill_value" : 0,
   "filters" : null,
   "order" : "C",
@@ -24,6 +25,5 @@
     4491,
     3540
   ],
-  "zarr_format" : 2,
-  "dimension_separator" : "/"
+    "zarr_format": 2
 }
\ No newline at end of file
sbesson commented 3 years ago

Adjusting the export command to use the same chunk size reduces the execution time further

(zarr) [sbesson@pilot-zarr1-dev data]$ time omero --debug DEBUG zarr export --bf --tile_width=3540 --tile_height=4491 Image:8343617
Using session for public@idr.openmicroscopy.org:4064. Idle timeout: 10 min. Current group: Public
bioformats2raw /nfs/bioimage/drop/idr0066-voigt-mesospim/20190821-ftp/ExperimentD/data/ExpD_chicken_embryo_stitched.ome.tif /data/ExpD_chicken_embryo_stitched.ome.tif --tile_width=3540 --tile_height=4491
OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/opencv_openpnp10586280222332712237/nu/pattern/opencv/linux/x86_64/libopencv_java342.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

Image exported to /data/ExpD_chicken_embryo_stitched.ome.tif

real    16m29.448s
user    44m52.741s
sys     2m45.422s

Interestingly, the dataset size is unchanged (70G vs 52G).

(zarr) [sbesson@pilot-zarr1-dev data]$ diff -urw 8343617.zarr/0/.zarray ExpD_chicken_embryo_stitched.ome.tif/0/0/.zarray 
--- 8343617.zarr/0/.zarray  2021-08-12 13:19:01.616102479 +0000
+++ ExpD_chicken_embryo_stitched.ome.tif/0/0/.zarray    2021-08-12 15:58:58.383689254 +0000
@@ -7,14 +7,13 @@
         3540
     ],
     "compressor": {
-        "blocksize": 0,
         "clevel": 5,
+    "blocksize" : 0,
+    "shuffle" : 1,
         "cname": "lz4",
-        "id": "blosc",
-        "shuffle": 1
+    "id" : "blosc"
     },
-    "dimension_separator": "/",
-    "dtype": "<u2",
+  "dtype" : ">u2",
     "fill_value": 0,
     "filters": null,
     "order": "C",
@@ -25,5 +24,6 @@
         4491,
         3540
     ],
-    "zarr_format": 2
+  "zarr_format" : 2,
+  "dimension_separator" : "/"
 }
\ No newline at end of file

Possibly a difference in compression efficiency between the Java and Python implementations?

sbesson commented 3 years ago

Tested the last commit with successive exports:

(zarr) [sbesson@pilot-zarr1-dev data]$ omero zarr export --bf Image:9842129
Using session for public@idr.openmicroscopy.org:4064. Idle timeout: 10 min. Current group: Public
OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/opencv_openpnp15643632484399810721/nu/pattern/opencv/linux/x86_64/libopencv_java342.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

Image exported to /data/2017-07-24-Brain-01_01_R3D.dv
(zarr) [sbesson@pilot-zarr1-dev data]$ omero zarr export --bf Image:9842129
Using session for public@idr.openmicroscopy.org:4064. Idle timeout: 10 min. Current group: Public
OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/opencv_openpnp5507427887558727498/nu/pattern/opencv/linux/x86_64/libopencv_java342.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
Exception in thread "main" picocli.CommandLine$ExecutionException: Error while calling command (com.glencoesoftware.bioformats2raw.Converter@5e2c3d18): java.lang.IllegalArgumentException: Output path /data/2017-07-24-Brain-01_01_R3D.dv already exists
    at picocli.CommandLine.executeUserObject(CommandLine.java:1962)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
    at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2172)
    at picocli.CommandLine.parseWithHandlers(CommandLine.java:2550)
    at picocli.CommandLine.parseWithHandler(CommandLine.java:2485)
    at picocli.CommandLine.call(CommandLine.java:2761)
    at com.glencoesoftware.bioformats2raw.Converter.main(Converter.java:1756)
Caused by: java.lang.IllegalArgumentException: Output path /data/2017-07-24-Brain-01_01_R3D.dv already exists
    at com.glencoesoftware.bioformats2raw.Converter.call(Converter.java:451)
    at com.glencoesoftware.bioformats2raw.Converter.call(Converter.java:92)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
    ... 9 more

(zarr) [sbesson@pilot-zarr1-dev data]$ rm -rf 2017-07-24-Brain-01_01_R3D.dv
(zarr) [sbesson@pilot-zarr1-dev data]$ omero zarr export --bf Image:9842129
Using session for public@idr.openmicroscopy.org:4064. Idle timeout: 10 min. Current group: Public
OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/opencv_openpnp4557707144645268562/nu/pattern/opencv/linux/x86_64/libopencv_java342.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

Image exported to /data/2017-07-24-Brain-01_01_R3D.dv
joshmoore commented 3 years ago

:+1:

sbesson commented 3 years ago

From a quick discussion with @joshmoore, I would propose to get this merged and look into the unification of the Zarr layouts as the next step using bioformats2raw and the series index.

Given the --series feature is only available in bioformats2raw 0.3.0, I suspect there will soon be almost no value in maintaining backwards compatibility with older versions of bioformats2raw and we should only support bioformats2raw 0.3 or later.