internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
40 stars 9 forks source link

Epub: address `epubcheck` validation errors #19

Open scottbarnes opened 2 months ago

scottbarnes commented 2 months ago

This PR adds two commits to address two separate epubcheck validation error.

The first relates to the mediatype (and HTML escaping), and the second relates to the table of contents.

With respect to the fix for OPF-043, epubcheck took issue with the text/html media_type for the spine, but once this was changed, HTML needed to be escaped or HTML in the book text might be rendered.

The second commit dealing with the table of contents simply adds the Internet Archive scanning notice to the table of contents.

ebooklib anticipates there will be a TOC when using epub.EpubNcx() and epub.EpubNav(), which hocr-to-epub does use. If those aren't used, the files would need to be constructed manually, as those files are required.

Validation prior to this PR:

❯ epubcheck ./test_output_no_toc_.epub 
Validating using EPUB version 3.3 rules.
ERROR(RSC-005): ./test_output_no_toc_.epub/EPUB/toc.ncx(12,12): Error while parsing file: element "navMap" incomplete; missing required element "navPoint"
ERROR(RSC-005): ./test_output_no_toc_.epub/EPUB/nav.xhtml(10,12): Error while parsing file: element "ol" incomplete; missing required element "li"

Check finished with errors
Messages: 0 fatals / 2 errors / 0 warnings / 0 infos

EPUBCheck completed

The toc.ncx file prior to this PR:

❯ cat unzipped_no_toc/EPUB/toc.ncx                                           
<?xml version='1.0' encoding='utf-8'?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
  <head>
    <meta content="sim_english-illustrated-magazine_1884-12_2_15" name="dtb:uid"/>
    <meta content="0" name="dtb:depth"/>
    <meta content="0" name="dtb:totalPageCount"/>
    <meta content="0" name="dtb:maxPageNumber"/>
  </head>
  <docTitle>
    <text>The English Illustrated Magazine  1884-12: Vol 2 Iss 15</text>
  </docTitle>
  <navMap/>
</ncx>

The nav.xhtml file prior to this PR:

❯ cat unzipped_no_toc/EPUB/nav.xhtml 
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
  <head>
    <title>The English Illustrated Magazine  1884-12: Vol 2 Iss 15</title>
  </head>
  <body>
    <nav epub:type="toc" id="id" role="doc-toc">
      <h2>The English Illustrated Magazine  1884-12: Vol 2 Iss 15</h2>
      <ol/>
    </nav>
  </body>
</html>

With the notice as the TOC the validation passes.

❯ epubcheck ./test_output_with_toc.epub 
Validating using EPUB version 3.3 rules.
No errors or warnings detected.
Messages: 0 fatals / 0 errors / 0 warnings / 0 infos

EPUBCheck completed

The toc.ncx file after this PR:

❯ cat unzipped_with_toc/EPUB/toc.ncx 
<?xml version='1.0' encoding='utf-8'?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
  <head>
    <meta content="sim_english-illustrated-magazine_1884-12_2_15" name="dtb:uid"/>
    <meta content="0" name="dtb:depth"/>
    <meta content="0" name="dtb:totalPageCount"/>
    <meta content="0" name="dtb:maxPageNumber"/>
  </head>
  <docTitle>
    <text>The English Illustrated Magazine  1884-12: Vol 2 Iss 15</text>
  </docTitle>
  <navMap>
    <navPoint id="chapter_0">
      <navLabel>
        <text>Notice</text>
      </navLabel>
      <content src="notice.html"/>
    </navPoint>
  </navMap>
</ncx>

The nav.xhtml file after this PR:

❯ cat unzipped_with_toc/EPUB/nav.xhtml 
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="en" xml:lang="en">
  <head>
    <title>The English Illustrated Magazine  1884-12: Vol 2 Iss 15</title>
  </head>
  <body>
    <nav epub:type="toc" id="id" role="doc-toc">
      <h2>The English Illustrated Magazine  1884-12: Vol 2 Iss 15</h2>
      <ol>
        <li>
          <a href="notice.html">Notice</a>
        </li>
      </ol>
    </nav>
  </body>
</html>