ome / openmicroscopy

OME (Open Microscopy Environment) develops open-source software and data format standards for the storage and manipulation of biological light microscopy data. A joint project between universities, research establishments and industry in Europe and the USA, OME has over 20 active researchers with strong links to the microscopy community. Funded by private and public research grants, OME has been a major force on the international microscopy stage since 2000.
https://www.openmicroscopy.org/omero
GNU General Public License v2.0
191 stars 100 forks source link

Remove Pixels name, repo and path columns #6380

Open sbesson opened 3 months ago

sbesson commented 3 months ago

Background

The OMERO 4.2.0 release in July 2010 included an alteration to the database schema to add name, path and repo columns to the Pixels table with a similar meaning as the columns in the OriginalFile table. This change was part of the initial work adding native file format support in OMERO via Bio-Formats also known as FS lite. For a subset of file formats (primarily single file and with large XY dimensions), the original file was uploaded to the binary repository and linked from the Pixels object. This allowed the server to perform certain operations including the generation of OMERO pyramids.

Full support for native file format support in OMERO, also known as OMERO.fs, was introduced in OMERO 5.0.0 in February 2014 with the introduction of the Fileset table linked to the Image. Each Fileset row is linked to an ordered set of FilesetEntry rows each of these being themselves associated with a single OriginalFile entry. This change effectively superseded the FS Lite concept allowing native support for single and multi-file formats as well as multi-image formats. In OMERO 5.1, the series column was also introduced to the Image table to store the mapping between an image and the underlying Bio-Formats series.

Current API

Despite OMERO 5 actually deprecating their usage, the Pixels.name, Pixels.path and Pixels.repo columns are still currently heavily used server-side as of OMERO 5.6.x:

Challenges

The current logic is problematic for several reasons:

As an additional related complication, a historical bug has been reported in the image.sc forum where the OriginalFile are incorrectly linked to FilesetEntry for some multi-file filesets.

Proposal

  1. Fix the ordering of the FilesetEntry so that it can be used as the single source of truth a. Fix the mapping between OriginalFile and FilesetEntry at import time - see https://github.com/ome/omero-blitz/pull/148 b. Create an upgrade script allowing to fix all existing FilesetEntry/OriginalFile links in existing OMERO databases c. Review the API and technical documentation of Bio-Formats and OMERO and if needed clarify and enforce that the first file in IFormatReader.getUsedFiles, the output of ImportCandidates and the firstFilesetEntry is the file that should be passed to IFormatReader.setId d. Optionally, create an upgrade script allowing to convert FS lite imports into Fileset
  2. Remove all legacy FS lite API and use the Fileset API consistently a. Create a new version of the OMERO database schema dropping the Pixels.name, Pixels.path and Pixels.repo columns b. Update all the server APIs to use the OriginalFile from the first FilesetEntry as the source of truth

/cc @joshmoore @jburel @kkoz @chris-allan @will-moore @dominikl @Tom-TBT

will-moore commented 3 months ago

When looking into the usage of Pixels path/name in IDR, I think I checked a bunch of Filesets in IDR to see if the first File matched the path/name in Pixels and found that there were differences in many cases. Unfortunately I can't remember where I documented that... Also I don't know if Bio-Formats would have behaved differently if the different path/name was used in setId().

will-moore commented 3 months ago

Ah - I found it: https://github.com/IDR/idr-metadata/issues/660#issuecomment-1570145684 So, testing 1 image from each study in IDR (to get a good mix of formats etc) it looks like there were 22 studies where the Pixels path/name didn't match the first OriginalFile from the Fileset.

sbesson commented 3 months ago

Thanks @will-moore looking at your list of mismatches, all of these examples are multi-file & multi-folder file formats, primarily HCS but not only. Also from a quick search using the IDR UI, it seems that the first FilesetEntry.clientPath is matches Pixels.path and Pixels.name. Both of these observations are consistent with my expectations based on preliminary investigation.

Also I don't know if Bio-Formats would have behaved differently if the different path/name was used in setId().

Unfortunately, the answer here is "it depends". In the worst case scenario, Bio-Formats would throw an UnknownFormatException on setId.

As a next step here my plan is to write a pre-check SQL script that iterates through all the Fileset in a database and tries to match the first FilesetEntry with any of the OriginalFile using fileset.clientPath and originalfile.{path,name}. We should be able to run this script against the IDR database and other OMERO databases to give us a feeling on whether we can fix these links in an authoritative manner.