python 'clean' on Mac deletes target file

michaelbroe commented 7 years ago

On Mac OSX 10.11, I pulled the latest version of tools, installed all dependencies, and ran 'clean' on the raw Gutenberg .html file. The target body.xhtml file is deleted. Here is the output running in verbose mode:

$ clean -v .
Processing /Users/michael/Documents/StandardEbooks/thomas-love-peacock_nightmare-abbey ...-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

^
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

^
-:16: parser error : Unescaped '<' not allowed in attributes values
<meta name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webma
                                                                          ^
-:16: parser error : attributes construct error
<meta name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webma
                                                                          ^
-:16: parser error : Couldn't find end of Start Tag meta line 16
<meta name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webma
                                                                          ^
-:16: parser error : error parsing attribute name
a name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webmaster
                                                                               ^
-:16: parser error : attributes construct error
a name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webmaster
                                                                               ^
-:16: parser error : Couldn't find end of Start Tag webmaster line 16
a name="generator" content="Ebookmaker 0.4.0a5 by Marcello Perathoner <webmaster
                                                                               ^
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

^
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '<' not found

^
 OK

acabal commented 7 years ago

OK, this is probably because Gutenberg HTML files are not XHTML files, and xmllint expects XHTML. I've just committed c701154 so that clean will now stop with an error if xmllint returns an error, instead of destroying the file. Hopefully this fixes things!

michaelbroe commented 7 years ago

So I looked into this a little further. The new version of clean on Mac no longer deletes the file, it simply leaves it untouched with the error files must be in XHTML format... But AFAIK the Gutenberg document I am working with is XHTML. It's at http://www.gutenberg.org/cache/epub/9909/pg9909.html, the doctype is <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>, and running it through validator.w3.org yields "This document was successfully checked as XHTML 1.1!"

I also ran this file through the new version of clean on Ubuntu, and got the same error.

Further, after I delete the doctype line which was making xmllint hang (as pointed out in the clean script), it seems to canonicalize fine at the command line:

xmllint --c14n --timing pg9909.html > pg9909.canonical.html
Parsing took 15 ms
Saving took 7 ms
Freeing took 0 ms

Yet in the script it seems it must be throwing an error there?

Apologies if I am missing something obvious, still very new to this..!

Cheers, Michael

acabal commented 7 years ago

Perhaps xmllint requires an <?xml ... ?> declaration at the beginning of a file?

In either case it really wasn't meant to be run on raw Gutenberg files. You should use split-files to insert our own header/footer before running clean.

However if you can investigate this further and get a fix in so that it does work with raw Gutenberg files, that would be very helpful!

rspeed commented 7 years ago

This is a weird one, but I think I figured it out.

The W3C has configured its web servers to add an artificial delay when serving DTDs, and this is causing xmllint to time out when it tries attempts to fetch xhtml11.dtd and its numerous module files. I suspect the reason this issue isn't being encountered by @acabal is because Debian distros install local copies of common DTDs by default. Why Apple doesn't do the same is anyone's guess.

The workaround is to set up your own a local copy and create a catalog file to point to it. Once I did that xmllint runs with no issues. Give this a try:

curl -O https://www.w3.org/TR/xhtml11/xhtml11.tgz
tar -xzf xhtml11.tgz
mkdir -p /usr/local/share/xhtml11/DTD
cp xhtml11-20101123/DTD/xhtml11-flat.dtd /usr/local/share/xhtml11/DTD/xhtml11.dtd
sudo xmlcatalog --create --noout /etc/xml/catalog
sudo xmlcatalog --noout --add public "-//W3C//DTD XHTML 1.1//EN" "file:///usr/local/share/xhtml11/DTD/xhtml11.dtd" /etc/xml/catalog

If you're wondering why I used xhtml11-flat.dtd it's because the archive doesn't include any of the module files. *shrug*

acabal commented 7 years ago

So, let me see if I understand this. On Mac xmllint has to contact W3C servers to download the DTD, which times out. After it times out, it outputs an empty string, causing clean to exit with an error.

But this is only for raw Gutenberg files, correct? If you run clean on a finished SE repo, then things work as expected?

If that's correct, then maybe a better solution would be to see how PG files differ from SE files, and have clean try to detect and correct that before running. clean already strips the <!doctype> declaration from PG files, and at a glance they seem to be more or less the same from there...

(BTW make sure you run git pull on the tools repo to make sure you have the latest version of all these tools, they're changing rapidly.)

rspeed commented 7 years ago

So, let me see if I understand this. On Mac xmllint has to contact W3C servers to download the DTD, which times out.

Correct.

After it times out, it outputs an empty string, causing clean to exit with an error.

Presumably. I was only testing with xmllint directly. In fact… now I'm not sure why @michaelbroe was encountering this error via clean since that should be removing the DOCTYPE before it gets run through xmllint. I'll take a look at that line in a bit to see if maybe it isn't working correctly in this instance.

But this is only for raw Gutenberg files, correct? If you run clean on a finished SE repo, then things work as expected?

I can't say for sure if it would affect all PG files, just anything that has a W3C DOCTYPE. Since that's not the case for any of the documents in the SE repos (presumably because they've already been run through clean) they wouldn't be affected.

acabal commented 7 years ago

If you're running a Mac maybe you can try running it on a raw PG file to see if the issue even still exists? Though without the specific problem file from @michaelbroe then we might not get very far.

rspeed commented 7 years ago

He provided a URL for the problematic file, which is what I've been using for testing. And indeed, I just gave it a run through clean and didn't encounter any errors.

@michaelbroe Before following my steps above, can you try running the original file through clean one more time?

acabal commented 7 years ago

OK, in that case I'll close this for now until @michaelbroe can get back to us. I'm assuming this was fixed in some earlier commit since it hasn't been reported since and we can no longer reproduce it.

michaelbroe commented 7 years ago

Hi,

I'm sure this can stay closed. Just for context, the only reason I got into this situation, is that the work I chose is a very short novella, so I didn't think it would be useful to section the document into many files. Given that, I got rid of all the Gutenberg stuff, but kept anything I thought was part of well-formed html structure, including the header markup. I didn't fully realize that new header information is supplied by the split-files command. It turns out the thing that trips up processing this file, is some syntax in the <head> section that I didn't prune. When I remove that it cleans correctly.

It seems to me the real issue is, if anyone attempts not to section their work in future, they should strip the header markup and find a way to add the xml wrappers usually supplied by split-files. I think it's unlikely a novice is going to run into this problem again.

Just to confirm the xmllint timing issues on my Mac:

$ xmllint --c14n --timing pg9909.html > pg9909.canonical.html
Parsing took 120127 ms
Saving took 7 ms
Freeing took 0 ms

# Removed doctype line

$ xmllint --c14n --timing pg9909.no_doctype.html > pg9909.no_doctype.canonical.html
Parsing took 5 ms
Saving took 4 ms
Freeing took 0 ms

This time-out was the original problem I think. The latest versions of clean now successfully strip this line and canonicalize correctly.

Cheers! Michael

standardebooks / tools

python 'clean' on Mac deletes target file #16