mailgun / talon

Apache License 2.0
1.26k stars 287 forks source link

Invalid HTML from extract_quotations. #154

Open nixypanda opened 6 years ago

nixypanda commented 6 years ago

Hi I was testing talon with some inputs and the following input: <div dir="ltr">ha ha ha <span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">эй чувак, как ты </span>😁<span style="font-size:12.8px"> lol</span></div><div class="gmail_extra"><br><div class="gmail_quote">2017-09-04 18:08 GMT+05:30 Sherub Thakur <span dir="ltr">&lt;<a href="mailto:sherub.thakur@kayako.com" target="_blank">sherub.thakur@kayako.com</a>&gt;</span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">эй чувак, как ты </span>😁<span style="font-size:12.8px"> lol</span><br></div></blockquote></div><br></div>

does not lead to the following output. <html><head></head><body><div dir="ltr">ha ha ha <span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">&#x44D;&#x439; &#x447;&#x443;&#x432;&#x430;&#x43A;, &#x43A;&#x430;&#x43A; &#x442;&#x44B; </span>&#x1F601;<span style="font-size:12.8px">&#194;&#160;lol</span></div><div class="gmail_extra"><br><br></div></body></html>

but leads to this output <html><head></head><body><div dir="ltr">ha ha ha&#194;&#160;<span style="color:rgb(33,33,33);font-size:29px;white-space:pre-wrap">&#209;&#65533;&#286;&#185; &#209;&#8225;&#209;&#402;&#286;&#178;&#286;&#176;&#286;&#186;, &#286;&#186;&#286;&#176;&#286;&#186; &#209;&#8218;&#209;&#8249; </span>&#287;&#376;&#732;&#65533;<span style="font-size:12.8px">&#194;&#160;lol</span></div><div class="gmail_extra"><br><br></div></body></html>

Which looks wrong. Can you guide me if there is something that I am doing wrong here?

This is how I am using it. ohtml = quotations.extract_from_html(html).encode('utf-8')

nixypanda commented 6 years ago

https://github.com/mailgun/talon/pull/156 seems to have done the trick of solving the issue. Unsure if there is some setting in lxml html5parser that can achieve the same effect.