ome / bioformats

Bio-Formats is a Java library for reading and writing data in life sciences image file formats. It is developed by the Open Microscopy Environment. Bio-Formats is released under the GNU General Public License (GPL); commercial licenses are available from Glencoe Software.
https://www.openmicroscopy.org/bio-formats
GNU General Public License v2.0
381 stars 241 forks source link

CZI: Reading individual files as part of a multi-file series #3854

Open saramcardle opened 2 years ago

saramcardle commented 2 years ago

If there are two individual images files in a folder with names like 'X.czi' and 'X (1).czi' bioformats can incorrectly read them as part of a single multi-file dataset. Windows often adds the (1) appendix when a user tries to copy a file into a folder that already contains a file of that name. Then, only the the data/metadata from X.czi will be read, even if X (1).czi is requested.

See here for more information: https://forum.image.sc/t/matlab-bfgetreader-error-with-parentheses-in-file-name/70210

There should be a more robust way of checking whether a file is a part of a multi-file dataset that is less reliant on naming convention.

dgault commented 2 years ago

Hi @saramcardle, thank you for opening the Issue and reporting this problem along with the imagesc thread. The CZI reader in Bio-Formats specifically looks for multi filesets to follow a specific naming convention (outlined below), which unfortunately happens to match exactly the same pattern as the appendix which Windows has added. In this scenario it may be possible for the Bio-Formats reader to add some additional sanity checks for the filesets.

    // check if we have the master file in a multi-file dataset
    // file names are not stored in the files; we have to rely on a
    // specific naming convention:
    //
    //  master_file.czi
    //  master_file (1).czi
    //  master_file (2).czi
    //  ...
    //
    // the number of files is also not stored, so we have to manually check
    // for all files with a matching name 
imagesc-bot commented 2 years ago

This issue has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/matlab-bfgetreader-error-with-parentheses-in-file-name/70210/5

melissalinkert commented 1 year ago

Some multi-file datasets do have a SplitScenesInSeparateFiles field, but unfortunately this can't reliably be used as it isn't present in all multi-file datasets (e.g. our existing QA 7271 has it, but QA 10301 does not). https://github.com/melissalinkert/bioformats/commit/c7894ee25840d91614429967e6484ce8b6c65d7b was an attempt at improving multi-file detection which might be a place to continue the investigation (but does not work as-is).

Note that Zen is not confused by an artificial dataset that has abc.czi and an unrelated abc (1).czi in the same directory.