Closed michaelbroe closed 7 years ago
OK, this is probably because Gutenberg HTML files are not XHTML files, and xmllint expects XHTML. I've just committed c701154 so that clean
will now stop with an error if xmllint returns an error, instead of destroying the file. Hopefully this fixes things!
So I looked into this a little further. The new version of clean
on Mac no longer deletes the file, it simply leaves it untouched with the error files must be in XHTML format...
But AFAIK the Gutenberg document I am working with is XHTML. It's at http://www.gutenberg.org/cache/epub/9909/pg9909.html
, the doctype
is <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
, and running it through validator.w3.org
yields "This document was successfully checked as XHTML 1.1!"
I also ran this file through the new version of clean
on Ubuntu, and got the same error.
Further, after I delete the doctype
line which was making xmllint hang (as pointed out in the clean
script), it seems to canonicalize fine at the command line:
xmllint --c14n --timing pg9909.html > pg9909.canonical.html
Parsing took 15 ms
Saving took 7 ms
Freeing took 0 ms
Yet in the script it seems it must be throwing an error there?
Apologies if I am missing something obvious, still very new to this..!
Cheers, Michael
Perhaps xmllint
requires an <?xml ... ?>
declaration at the beginning of a file?
In either case it really wasn't meant to be run on raw Gutenberg files. You should use split-files
to insert our own header/footer before running clean
.
However if you can investigate this further and get a fix in so that it does work with raw Gutenberg files, that would be very helpful!
This is a weird one, but I think I figured it out.
The W3C has configured its web servers to add an artificial delay when serving DTDs, and this is causing xmllint
to time out when it tries attempts to fetch xhtml11.dtd
and its numerous module files. I suspect the reason this issue isn't being encountered by @acabal is because Debian distros install local copies of common DTDs by default. Why Apple doesn't do the same is anyone's guess.
The workaround is to set up your own a local copy and create a catalog file to point to it. Once I did that xmllint
runs with no issues. Give this a try:
curl -O https://www.w3.org/TR/xhtml11/xhtml11.tgz
tar -xzf xhtml11.tgz
mkdir -p /usr/local/share/xhtml11/DTD
cp xhtml11-20101123/DTD/xhtml11-flat.dtd /usr/local/share/xhtml11/DTD/xhtml11.dtd
sudo xmlcatalog --create --noout /etc/xml/catalog
sudo xmlcatalog --noout --add public "-//W3C//DTD XHTML 1.1//EN" "file:///usr/local/share/xhtml11/DTD/xhtml11.dtd" /etc/xml/catalog
If you're wondering why I used xhtml11-flat.dtd
it's because the archive doesn't include any of the module files. *shrug*
So, let me see if I understand this. On Mac xmllint
has to contact W3C servers to download the DTD, which times out. After it times out, it outputs an empty string, causing clean
to exit with an error.
But this is only for raw Gutenberg files, correct? If you run clean
on a finished SE repo, then things work as expected?
If that's correct, then maybe a better solution would be to see how PG files differ from SE files, and have clean
try to detect and correct that before running. clean
already strips the <!doctype>
declaration from PG files, and at a glance they seem to be more or less the same from there...
(BTW make sure you run git pull
on the tools repo to make sure you have the latest version of all these tools, they're changing rapidly.)
So, let me see if I understand this. On Mac xmllint has to contact W3C servers to download the DTD, which times out.
Correct.
After it times out, it outputs an empty string, causing
clean
to exit with an error.
Presumably. I was only testing with xmllint
directly. In fact… now I'm not sure why @michaelbroe was encountering this error via clean
since that should be removing the DOCTYPE before it gets run through xmllint
. I'll take a look at that line in a bit to see if maybe it isn't working correctly in this instance.
But this is only for raw Gutenberg files, correct? If you run
clean
on a finished SE repo, then things work as expected?
I can't say for sure if it would affect all PG files, just anything that has a W3C DOCTYPE. Since that's not the case for any of the documents in the SE repos (presumably because they've already been run through clean
) they wouldn't be affected.
If you're running a Mac maybe you can try running it on a raw PG file to see if the issue even still exists? Though without the specific problem file from @michaelbroe then we might not get very far.
He provided a URL for the problematic file, which is what I've been using for testing. And indeed, I just gave it a run through clean
and didn't encounter any errors.
@michaelbroe Before following my steps above, can you try running the original file through clean
one more time?
OK, in that case I'll close this for now until @michaelbroe can get back to us. I'm assuming this was fixed in some earlier commit since it hasn't been reported since and we can no longer reproduce it.
Hi,
I'm sure this can stay closed. Just for context, the only reason I got into this situation, is that the work I chose is a very short novella, so I didn't think it would be useful to section the document into many files. Given that, I got rid of all the Gutenberg stuff, but kept anything I thought was part of well-formed html structure, including the header markup. I didn't fully realize that new header information is supplied by the split-files command. It turns out the thing that trips up processing this file, is some syntax in the <head>
section that I didn't prune. When I remove that it cleans correctly.
It seems to me the real issue is, if anyone attempts not to section their work in future, they should strip the header markup and find a way to add the xml wrappers usually supplied by split-files. I think it's unlikely a novice is going to run into this problem again.
Just to confirm the xmllint timing issues on my Mac:
$ xmllint --c14n --timing pg9909.html > pg9909.canonical.html
Parsing took 120127 ms
Saving took 7 ms
Freeing took 0 ms
# Removed doctype line
$ xmllint --c14n --timing pg9909.no_doctype.html > pg9909.no_doctype.canonical.html
Parsing took 5 ms
Saving took 4 ms
Freeing took 0 ms
This time-out was the original problem I think. The latest versions of clean now successfully strip this line and canonicalize correctly.
Cheers! Michael
On Mac OSX 10.11, I pulled the latest version of tools, installed all dependencies, and ran 'clean' on the raw Gutenberg .html file. The target body.xhtml file is deleted. Here is the output running in verbose mode: