google-code-export / fanficdownloader

Automatically exported from code.google.com/p/fanficdownloader
0 stars 0 forks source link

EPUB output is invalid: XHTML has nested <p> elements. #9

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Generate EPUB output, e.g., "python downaloder.py 
http://www.fanfiction.net/s/5782108/1/ epub".
2. Test it with epubcheck or http://threepress.org/document/epub-validate/. The 
most common error will likely be something that looks like: "ERROR: 
Five_Times_Ashlyn_Sniped_at_Boone.epub/OEBPS/MS4gdGhlbWUgMzogaG9wbGVzcw==.xhtml(
18): element "p" from namespace "http://www.w3.org/1999/xhtml" not allowed in 
this context".

What is the expected output? What do you see instead?
The EPUB output is invalid. While it may work on some devices, it may fail on 
others.

What version of the product are you using? On what operating system?
I'm using current tip (26:54fc9b30ced5) on Python 2.6.4 (Ubuntu 9.10), plus the 
patches from issue 6, issue 7 and issue 8 (which don't affect this issue).

Please provide any additional information below.
The use of BeautifulSoup to clean the HTML has the side effect of causing some 
tags to nest. For instance, using FanFiction.net, the body of each chapter is 
contained in a 'div' element, which itself contains a series of 'p' elements. 
However, when this outermost 'div' element is renamed to a 'p' element, it 
invalidates the syntax, because 'p' elements cannot nest directly inside each 
other.

Additionally, using BeautifulStoneSoup rather than BeautifulSoup causes the 
parser not to know that 'hr' and 'br' tags are self-closing (i.e., they 
shouldn't contain anything). It then extends each, e.g., 'hr' tag until the 
start of the next 'hr' tag. Later, when 'hr' elements are converted to 'p' 
elements, we get nested 'p' elements, and therefore invalid XHTML, which causes 
EPUB validation to fail. Using BeautifulSoup instead causes these tags to 
auto-close, which prevents that source of nested 'p' elements.

On the FanFiction.net examples that I tested, at least, changing the parser 
used from BeautifulStoneSoup to BeautifulSoup and commenting out the code that 
changed 'br', 'hr' and 'div' elements to 'p' elements led to valid markup. Is 
there a reason that those elements were being changed? (There's a note in issue 
3 asking the same thing.)

The attached patch makes these changes (it also makes "allPs" actually only 
refer to all 'p' elements); as I haven't validated a full testsuite, I can't 
guarantee that this doesn't mix something else up. I don't know why one would 
want to (for instance) get rid of all 'hr' elements, but if there's a reason, 
you might want to take this patch with a grain of salt.

Original issue reported on code.google.com by adam.buc...@gmail.com on 16 Sep 2010 at 8:15

Attachments:

GoogleCodeExporter commented 9 years ago
While using BeautifulSoup instead of BeautifulStoneSoup does improve epubcheck 
compliance, there are still stories that violate the strict nesting rules.

However, there are still stories that violate the nest rules even with 
BeautifulSoup.  Plus, none of the readers I've tested have objected to the 
BeautifulStoneSoup output.  So without more reason to, we're leery of swapping 
the html parsers.

The allPs loop has been replaced with an allTags loop that checks several 
different things.

Again, thank you for your help, and I apologize for not taking advantage of 
your patches.

Original comment by retiefj...@gmail.com on 16 Oct 2010 at 2:20