ome / bioformats

Bio-Formats is a Java library for reading and writing data in life sciences image file formats. It is developed by the Open Microscopy Environment. Bio-Formats is released under the GNU General Public License (GPL); commercial licenses are available from Glencoe Software.
https://www.openmicroscopy.org/bio-formats
GNU General Public License v2.0
370 stars 241 forks source link

Allow processing multiple files sequentially in a single run, with file names read from stdin. #4200

Open tstoeter opened 1 week ago

tstoeter commented 1 week ago

This reduces JVM startup overhead when processing thousands of files. If -- is passed as file argument then file names are read line by line from stdin and processed sequentially. Usage information for reading file names from stdin are added with the second commit.

Code compiles with ant clean jars tools, compilation with Maven not tested. All tests using ant test have passed.

melissalinkert commented 1 week ago

Thanks, @tstoeter. Could you give a sense of the performance improvement you observe with and without this change?

I'm also curious what the use case is for running showinf on many files at once. I can understand where this feature would be useful for conversion (i.e. in bfconvert), but less clear on how it applies to just displaying metadata/images. Are you parsing the output of showinf in some way?

tstoeter commented 1 week ago

Quick tests showed a 5x speedup for processing and printing metadata of 200 DICOM files (without displaying the images).

Our use-case is reading out metadata of many files in order to build relationships between them, based on their metadata. So yes, the metadata output will then be process further. We do not display the images.

Certainly, the code can also be quickly adopted by bfconvert for speeding up batch processing converting images, too.

melissalinkert commented 1 week ago

I do see a noticeable difference in a simple test without this change:

$ bfconvert "test&sizeX=10&sizeY=10&sizeZ=1000.fake" test_z%z.tiff
$ time for i in `ls *.tiff`; do showinf -no-upgrade -nopix $i; done

and with this change (on the same data):

$ time ls *.tiff | showinf -no-upgrade -nopix --

I'd be fine with including this in CI builds, but defer to @ome/formats on timing.

Note that before merging this pull request, we will need to have a signed Contributor License Agreement. @tstoeter, please submit the CLA according to the instructions here: https://ome-contributing.readthedocs.io/en/latest/cla.html

Quick tests showed a 5x speedup for processing and printing metadata of 200 DICOM files (without displaying the images).

Our use-case is reading out metadata of many files in order to build relationships between them, based on their metadata. So yes, the metadata output will then be process further. We do not display the images.

Interesting. Outside the scope of this pull request, note that showinf may not print every DICOM tag in a file. All tags can be accessed via the initialized DICOM reader, though, which may be a more reliable way to compare metadata across files.

tstoeter commented 1 week ago

Thanks a lot @melissalinkert for the additional tests and pointing out the limitations of showinf regarding DICOM tags. I have signed and submitted the CLA and would be happy if my PR got merged.

joshmoore commented 1 week ago

All tags can be accessed via the initialized DICOM reader, though, which may be a more reliable way to compare metadata across files.

Thanks, @melissalinkert. Outside the scope of this PR, but I wonder if as a parallel to the ability to add metadata from #4016 there could be a flag to include the additional metadata in an OME-XML annotation.

tstoeter commented 1 week ago

Thank you @sbesson. I intended to contrast the -- functionality from the other options, but using - instead works just as well.