ome / bioformats

Bio-Formats is a Java library for reading and writing data in life sciences image file formats. It is developed by the Open Microscopy Environment. Bio-Formats is released under the GNU General Public License (GPL); commercial licenses are available from Glencoe Software.
https://www.openmicroscopy.org/bio-formats
GNU General Public License v2.0
377 stars 242 forks source link

Use the cached XML schema for improved validation #3699

Closed sbesson closed 3 years ago

sbesson commented 3 years ago

Originally posted by @sbesson in https://github.com/ome/bioformats/issues/3697#issuecomment-847306437

The OME-XML validation calls used by certain readers and the utilities try to retrieve the upstream XML schema from the W3C website which is throttling such calls - see https://www.w3.org/Help/Webmaster#slowdtd. For libraries, the accepted recommendation is to cache these schemas and use catalog files.

The ome-model repository already includes a copy of the XML schema as well as a catalog file and these resources are bundled as part of the org.openmicroscopy:specification JAR but it looks like these are not being used.

Being able to use the internal cache file should greatly improve the experience and performance when it comes to validating OME-XML fragments.

ngladitz commented 3 years ago

Don't quite know what I am doing here or if it is generic enough or how it is intended exactly but this does seem to work:

diff --git a/components/formats-api/src/loci/formats/services/OMEXMLServiceImpl.java b/components/formats-api/src/loci/formats/services/OMEXMLServiceImpl.java
index 2ba3a7069a..a20d7047a5 100644
--- a/components/formats-api/src/loci/formats/services/OMEXMLServiceImpl.java
+++ b/components/formats-api/src/loci/formats/services/OMEXMLServiceImpl.java
@@ -174,6 +174,8 @@ public class OMEXMLServiceImpl extends AbstractService implements OMEXMLService
             /* from specification.jar */
             return getClass().getResourceAsStream("/released-schema/" +
                  matcher.group(1) + "/" + matcher.group(2) + ".xsd");
+          } else if(url.equals("http://www.w3.org/2001/xml.xsd")) {
+            return getClass().getResourceAsStream("/released-schema/external/xml.xsd");
           } else {
             return null;
           }

Got me the sameish speed improvement as removing validation in the OBF reader for my small test file. This may indicate that removing validation (especially if it turns out to be problematic after all) can perhaps be omitted but I've still got to test this with something more bulky.

dgault commented 3 years ago

Closing this issue as the PR https://github.com/ome/bioformats/pull/3701 provides a solution