ome / design

OME Design proposals
http://ome.github.io/design/
1 stars 15 forks source link

Bio-Formats discoverability #42

Open sbesson opened 8 years ago

sbesson commented 8 years ago

This issue starts defining the Bio-Formats API allowing readers to be be discoverable. For more background and history, read the following GitHub Pull Request.

Status

The canonical and recommended usage of Bio-Formats for most clients is to create an ImageReader and initialize a reader from a given file it using setId(String):

r = new ImageReader();
r.setId(file);

Internally, the ImageReader contains a list of all available Bio-Formats readers. For each call to setId(), this list of reader classes is sequentially scanned to identify the image file type and the reader to be used for opening it.

Until Bio-Formats 5.1.10, the internal logic for generating the list of available format readers has been to rely on a configuration text file readers.txt defining an ordered list of reader classes. When creating an ImageReader, a list of reader classes is then created using ClassList.

Issues/limitations

The major limitation of the central definition of the reader classes described above is its scalability. The current development workflow of Bio-Formats follows a process described in CONTRIBUTING.md including PR testing, representative data.

While this workflow has shown its benefits in terms of the robustness and stability of Bio-Formats over the year, as the community grows and the usages vary, it also brings several downsides when dealing with one or several of the following use cases:

Examples of these use cases include the SlideBook6Reader from @Intelligent-Imaging, the ScreenReader largely extended as part of the IDR project /cc @simleo @joshmoore and more generally any new reader developed by the community.

Suggestions

The decoupling of the OMERO/Bio-Formats development process allows more rapid independent releases of Bio-Formats. However this gain is not sufficient to meet all of the requirements above. As discussed during the 2016 OME Users Meeting, we currently see three options to move forward:

  1. preserve the status quo and maintain a single central definition of all available readers and a strict organization PR based workflow for updating the readers
  2. add some infrastructure to drop the requirements on a configuration file to list the available readers. This could happen by using a controlled set of class annotations at the reader level defining their Bio-Formats nature - see this Pull Request
  3. add some infrastructure to discriminate between centrally maintained versus externally maintained readers. Concretely, this would mean splitting the list into two and decrease the stringency for the external list (no error on missing readers…) allowing to update it more frequently.
sbesson commented 8 years ago

Following short discussion with @melissalinkert, the minimal body of work required for the solution 3 is to review loci.formats.ClassList and extend its API to achieve the following functionalities:

Immediately, I would imagine we want the following methods added to the API:

  public void prepend(ClassList<T>)
  public void append(ClassList<T>)
  public static ClassList<T> parse(String, Class<?>)

On the sorting front, the question is how to deal with the ordering of reader classes coming from different sources/files. In the discovery PR, the proposal was to use an enum type - see https://github.com/openmicroscopy/bioformats/pull/2180/files#diff-bacd350b2f0731fcb17f90b7de75f8fdR35.

stelfrich commented 8 years ago

From a developer's perspective I would definitely favor solution 2 for maximum flexibility. With that said, I was wondering if you have considered/evaluated SciJava Common as a possible implementation of such a solution. It offers the extensibility you are looking for using annotations combined with implementations of a PluginService that can act as a central entry point for the discovery of additional readers.

Furthermore, the priority of a reader could be used to favor a specific (vendor) implementation if available at runtime. This might be become useful in the case of SlideBookReader..

but scanning all packages on the classpath adds substantially to the scan time (and thus setId time in most cases)

Couldn't the issue of speed (see @melissalinkert's comment in the PR) be addressed by scanning the classpath once for available readers and match a reader against this set of available readers every time setId() is called (i.e. only query the PluginService instead of scanning the classpath repeatedly)?