openpreserve / odf-validator

Open source Open Document Format (ODF) validation
http://odf.openpreservation.org/
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Misleading error messages for extended package formats #151

Open maria-messerschmidt opened 5 months ago

maria-messerschmidt commented 5 months ago

According to POL_2 the file MUST comply with the standard “OASIS Open Document Format for Office Applications (OpenDocument) v1.3”. Note that an ODF 1.3 extended package is not permitted.

These checks are not reporting back that an extended package format is being validated. Instead a long list (hundreds!) of XML-4 errors are reported, creating some confusion as to what the issue is.

I would expect this to produce a POL_2 error message along with a more specific error message indicating what the compliance issue is, i.e. use of extended package format.

The behaviour is similar (only difference is the number of XML errors) for 1.2 Extended and 1.3 Extended. For 1.2 Extended, it should also flag that the version is wrong (see other issue).

carlwilson commented 4 months ago

So I think that this breaks down to two issues:

Comments/thoughts are welcome.

maria-messerschmidt commented 4 months ago

I think it is fine to split this into two. We do not anticipate allowing extended packages, but of course this could be relevant for other users. Can we use this to track the "single error message" part of the issue and then #153 for validating extended packages? Do you need any examples or input to progress the error message part? My very manual way of checking is to search content.xml for "loext" since many elements have this prefix in the extended packages, but I am not sure this will work in all cases. I have not found a better way to check though.

maria-messerschmidt commented 2 months ago

We discussed possibly resolving this by looking at the headers and namespaces, but unfortunately, I don't think that will work.

I have (so far) found four prefixes/namespaces related to extended packages:

calcext:"urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0" field:"urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" loext:"urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0" formx:"urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0"

Of these, the namespace for "loext" seems to be found in the manifest header of ODF 1.3 package conforming (i.e not extended) format, and the namespace for "formx" is found in the content.xml header of ODF 1.3 package conforming format. This is fine of course, but I believe this means, we cannot just validate the headers.

The specification defines the valid prefixes associated with defined XML namespaces (Table 1-6, Section 1.5 of the ODF Schema Specification). So we may need to check for elements containing any invalid prefixes (or possibly one of the four extended ones - I am not sure if this is all of them) and report that as "extended package format".

Ideally, we would have a single error message for this and #150 saying something like (with the relevant selection based on validation of the file):

"The file must be an ODF package v1.3. Version detected: {1.0/1.1 | 1.2 | 1.2 Extended package | 1.3 Extended package}."

Then any XML-4 errors related to such prefixes should not be shown.