nprapps / copydoc

Like copytext, but for docs
MIT License
12 stars 6 forks source link

Copydoc does not handle horizontal rules correctly #11

Open jjelosua opened 8 years ago

jjelosua commented 8 years ago

Input:

<hr><p style="padding:0;margin:0;color:#000000;font-size:11pt;font-family:&quot;Arial&quot;;line-height:1.15;height:11pt;text-align:left"><span></span></p><p style="padding:0;margin:0;color:#000000;font-size:11pt;font-family:&quot;Arial&quot;;line-height:1.15;text-align:left"><span>-------&gt; DO NOT WRITE BELOW THE LINE &lt;-------</span></p>

Output:

<hr> <p> -------> DO NOT WRITE BELOW THE LINE <------- </p> </hr>
jjelosua commented 7 years ago

It looks that it is a BeautifulSoup4 problem: Diagnostic running on Beautiful Soup 4.4.1 Python version 2.7.11 (default, Jan 22 2016, 08:28:37) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] I noticed that html5lib is not installed. Installing it may help. lxml is not installed or couldn't be imported.

Trying to parse your markup with html.parser Here's what html.parser did with the markup:

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
 </head>
 <body style="background-color:#ffffff;padding:72pt 72pt 72pt 72pt;max-width:468pt">
  <hr>
   <p style='font-size:11pt;padding:0;font-family:"Arial";margin:0;color:#000000'>
    <span style="font-style:italic">
     NPR: eh-test-link the seventh with
    </span>
    <span style="font-style:italic;color:#1155cc;text-decoration:underline">
     <a href="https://www.google.com/url?q=http://www.npr.org/&amp;sa=D&amp;ust=1474862874391000&amp;usg=AFQjCNFqt0rLmuWX1Yt0VH_bsnt0UJmITg" style="color:inherit;text-decoration:inherit">
      link
     </a>
    </span>
    <span style="font-style:italic">
     and other things
    </span>
   </p>
  </hr>
 </body>
</html>