openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
161 stars 78 forks source link

Incorrect processing of XHTML if XML declaration is missing? #904

Open RvanVeenendaal opened 4 months ago

RvanVeenendaal commented 4 months ago

In some of our XHTML 1.0 Transitional files the XML declaration is missing. As a result, JHOVE reports HTML-HUL-16 ("Unrecognized or missing DOCTYPE declaration; validation continuing as HTML 3.2"). If I manually add the XML declaration, the document is processed as XHTML (by the XML module) and JHOVE e.g. correctly finds an unclosed tag somewhere in the document.

According to the XHTML specifications, "An XML declaration is not required in all XML documents" (https://www.w3.org/TR/xhtml1/normative.html). For XHML 1.1 the XML declaration is also a 'SHOULD' have, not a 'MUST'. It seems that JHOVE expects that there always is an XML declaration.

Could this please be fixed, so that JHOVE correctly processes XHML files without an XML declaration?

Example of problem (edit to see all markup):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

TITLE
TEXT

JHOVE 1.26.1 output (Dutch):

Documents C:\Temp\example.htm Module HTML-hul Release: 1.4.2 Date: 22-apr-2022 RepInfo URI: C:\Temp\example.htm LastModified: Mon Mar 04 15:45:26 CET 2024 Size: 534 Format: HTML Status: Not well-formed Messages ErrorMessage: Onherkend of ontbrekende DOCTYPE declaratie; validatie wordt verder gezet als HTML 3.2 ID: HTML-HUL-16 InfoMessage: This HTML version is currently not supported, falling back to HTML 3.2 ID: NO-ID ErrorMessage: Ongedefinieerd attribuut voor element ID: HTML-HUL-7 SubMessage: Name = html, Attribute = xmlns, Line = 2, Column = 7 ErrorMessage: De constructie met "/>" is onjuist, behalve in XHTML ID: NO-ID SubMessage: Name = meta, Line = 4, Column = 10 ErrorMessage: De constructie met "/>" is onjuist, behalve in XHTML ID: NO-ID SubMessage: Name = link, Line = 6, Column = 10 MimeType: text/html

Example with XML declaration added:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

TITLE
TEXT

JHOVE 1.26.1 output (Dutch):

Documents C:\Temp\example_with_XML_declaration.htm Module XML-hul Release: 1.5.2 Date: 22-apr-2022 RepInfo URI: C:\Temp\example_with_XML_declaration.htm LastModified: Mon Mar 04 15:48:22 CET 2024 Size: 574 Format: XML Status: Not well-formed SignatureMatches XML-hul Messages ErrorMessage: SAXParseException ID: XML-HUL-1 SubMessage: The element type "link" must be terminated by the matching end-tag "". Line = 8, Column = 7. MimeType: text/xml

carlwilson commented 3 months ago

Thanks for reporting this. We will try to reproduce the issue and get back to you if we have questions.