w3c / epubcheck

The conformance checker for EPUB publications
https://www.w3.org/publishing/epubcheck/
BSD 3-Clause "New" or "Revised" License
1.62k stars 403 forks source link

Message codes revamp #1092

Open rdeltour opened 4 years ago

rdeltour commented 4 years ago

TL;DR: should we revamp EPUBCheck’s message code system? if yes, how?

Background: EPUBCheck message codes

All the validation messages (e.g. warnings and errors) produced by EPUBCheck are associated to codes. For example, message codes can be RSC-005, PKG-007, etc.

The first 3 letters indicate a topic the message is related to (e.g. HTM for XHTML Content Documents, PKG for package-related issues, NAV for issues relate to the Navigation Documents). The second part of the code is a number which is incremented when a new check is implemented an we need a new message.

Drawbacks of the current system

These codes and their organization can be confusing, for various reasons.

First, the topic code (the first 3 letters) may not be always helping what the error is related to:

Then, the numbering scheme is a bit wonky:

Possible refactoring

There are several way to revamp the message code system, for instance:

Questions

samalloing commented 4 years ago

Hi @rdeltour

We use the message codes, but not yet in an automated way so they can be changed for us. We are happy with the current system, but we don't really have a outspoken opinion about how it should be better. The only thing that would be interesting to add is if something is an error or warning. We select which problems we need to deal with. We for example could say we can ignore all the warnings. This is important because the Severity element in the XML output says error even if it is a warning. We select on this severity element is our process. But also valid files (status is valid and well-formed) can have a severity error.

Thanks for all the work on epubchevk!

Sam

bitsgalore commented 4 years ago

Don't know if this helps, but you might want to have a look at how VeraPDF (a conformance checker for the PDF/A standards) handles this. They created validation rules, where each rule contains an explicit reference to the standard it applies to, as well as the clause in the standard on which the rule is based. Below is an example:

  <rule specification="ISO 19005-1:2005" clause="6.7.2" testNumber="1" status="failed" passedChecks="0" failedChecks="1">
    <description>The document catalog dictionary of a conforming file shall contain the Metadata key.</description>
    <object>PDDocument</object>
    <test>metadata_size == 1</test>
    <check status="failed">
      <context>root/document[0]</context>
    </check>
  </rule>

The obvious advantage is that it establishes a direct link between the validator and the filespec. Perhaps it is possible to use something similar for EPUBCheck?

A possible argument against doing this is that it might complicate things if new versions of the filespec are organised differently than the current one, since that would break this link, and fixing this could turn into a major pain, especially if there are frequent updates to the spec. Also, looking at the evolution of EPUB thus far, I think changes to the format have been both more frequent and more radical than changes to the PDF/A profiles, so the situations for both formats may not be completely comparable. In any case this would require quite a bit of coordination between the writers of the filespec and the EPUBCheck developers.

It might also be a good idea to get in touch with the VeraPDF developers at the Open Preservation Foundation (OPF). One of the other tools they're maintaining is JHOVE, and they're currently working on a JHOVE EPUB module that wraps EPUBCheck. So they will probably be both interested in this and willing to help.

sci-phi commented 4 years ago

The VST system uses the short-codes from EPUBCheck to interpret or discard preflight messages

        if (message.code.equalsIgnoreCase("RSC-005")) {
            if (message.message.contains("spine")) {
                // Reject spine-related errors
                return RESULT_FAIL;
            }
        }
GarthConboy commented 4 years ago

Yes, we use, and are dependent upon, these error codes. These form the basis of our ingestion whitelisting system. The existence of these codes and their immutability makes integration of each updated epubcheck version easy for Google Play. We would vote (strongly) for "stay the course."

karenhanson commented 4 years ago

I contributed the first iteration of the EPUB module for JHOVE mentioned above and it's part of the current release candidate. It makes use of the severity level and the 3-letter prefix (PKG only) to assign Well-Formedness and Validity. The documentation explains how they are used. It being a new module I was anticipating some maintenance, and wondered if I might need to refine how the message codes are interpreted. Will stay tuned!

vincent-gros commented 4 years ago

Hi @rdeltour,

We use error codes for automated analysis on multiple files. It will be harder without them. Refactoring based on specs could be a good idea.