realpython / python-guide

Python best practices guidebook, written for humans.
https://docs.python-guide.org
Other
28.13k stars 5.82k forks source link

generated epub contains one chapter twice #907

Open Troyciv opened 6 years ago

Troyciv commented 6 years ago

The epub version contains the chapter "lower level: virtualenv" two times.

edit: actually, it seems that only the TOC is somehow wrong. And there are two chapters that appear twice in the TOC: "Pyenv & Virtual Environments" and "Lower level: virtualenv"

image

Hasimir commented 6 years ago

This is an EPUB built with Sphinx, to be totally honest it would be a miracle if it validated.

There are no miracles here, as quick check of the latest EPUB version indicates it currently has at least 1,259 validation errors and probably more that can't be found because other errors block the checking. Many of these will be repetitious.

EPUB generation support in Sphinx was added more as an after thought on the theory that since it's essentially HTML and CSS and that's being built anyway, then why not. I understand the sentiment, but the focus of Sphinx is still live websites for documentation and not EPUBs which do have additional standards for self-contained documents (i.e. you shouldn't need network connectivity for it to be complete).

Sphinx does not build final EPUB files which meet those standards and introduces all sorts of errors along the way. Most likely this replication will be in referencing the chapter twice and building the TOC file dutifully replicated that. The chapter won't be there twice because if it had been it would appear in the manifest twice and an additional validation check on the content.opf file gave it the all clear.

Indeed the toc.ncx file alone contains 30 validation errors, all of the same type:

System ID: zip:file:/tmp/python-guide.epub!/toc.ncx
Main validation file: zip:file:/tmp/python-guide.epub!/toc.ncx
Scenario name: NCX
Document type: NCX
Engine name: Schematron 1.5
Severity: error
Description: different playOrder values for navPoint/navTarget/pageTarget that refer to same target
Start location: 295:46

Another thing you shouldn't need to see in an EPUB file, even if it might be of use on a website, is a 404 error. Yet in this EPUB there is indeed a 404.xhtml file. Even ignoring that as an aspect of book design itself, whether digital or not, even it adds to the total error count with six more validation errors because Sphinx doesn't produce real XHTML, and EPUB 3.0.1 (which is what it's claiming to be) is HTML5+XML (sometimes referred to as XHTML5).

Now wanting to fix the EPUB for this project is commendable, but anything done here would be like trying to patch a severed neck with a band-aid; messy and futile.

Sphinx is much like the other Python based EPUB builder, Calibre, in that it will usually produce something which can mostly be read in most ereader software, though it may be a bit ugly in places. The official Python documentation suffers from exactly the same thing in EPUB form and for exactly the same reasons, but it's still usable in a number of circumstances and so it's still provided.

On the other hand no one uses Sphinx to produce publishable commercial EPUBs in the publishing (including tech publishing) industry. Some might use it as a starting point, as many more do with Calibre, but in both cases there's a lot of post-production needed to beat things back into a validating form. That's what Smashwords does with Calibre; that's what starts the processes which they refer to as their meatgrinder (and even that has been known to miss things, even with their founder's own books).

I love Python, I really do, it helps me make a living; but none of my EPUB production goes anywhere near Python, Sphinx or Calibre. That's pretty much entirely due to the breadth and depth of the problems in producing validating output from both of those projects and I was most eager to check them when I started picking publishing solutions. This project probably doesn't need the perfectionist extreme I went for, except if building a version for sale; but that would be done differently and for similar reasons to why a press print ready PDF/X-3:2002 file produced for the O'Reilly edition of the Guide would not use the LaTeX PDF we can download from readthedocs.org.

In fact, in the last several years I have only seen one EPUB built with Sphinx which produced no errors at all and that was produced by and for the alot project (a frontend for notmuch mail). When I asked them how they managed it, the response was that they didn't know, they weren't familiar with Sphinx, had just adopted it to get their docs on the same site as everyone else's and it must've been beginners' luck as more often than not they tended to ignore the web.

As for fixing the Guide's EPUBs, there are only two viable solutions:

  1. Write something which would take a Sphinx generated EPUB, extract it, cut anything you knew would be generated and needed to be cut, tidy up the XHTML a bit and then rezip in the correct format.
  2. Work with Pocoo to fix Sphinx so it produces valid EPUBs in the first place without breaking existing documentation builds in other formats or reducing functionality.

Option one would be subject to change based on source material and not necessarily be portable to other EPUBs built the same way. Option two would take years and may or may not be of any real priority with the Sphinx developers since their focus is not and never has been publishing beyond the live documentation level, for which the HTML output is clearly good enough.

I'm certainly not opposed to anyone doing either of these things, but I do think anyone considering them should really understand just how big a job this really is.

I had a good, long look about three or four years ago or so and ultimately went very far from Python and Sphinx. I considered trying to fix Sphinx, but realised that I could either do what I originally wanted to do in writing and publishing with something else or spend all my time fixing Sphinx instead. I opted for continuing to pursue what I I'd originally planned and I'd been wanting to do it for years (and still do).

So that's why, even though I've got a pretty good idea of the problems here in a broad sense, I'm not directing my efforts that way. Still, I wish anyone game enough to take a crack at it the best of luck.