mailgun / talon

Apache License 2.0
1.27k stars 285 forks source link

Weird HTML entities in extract_from_html #146

Open hodak opened 7 years ago

hodak commented 7 years ago

Hi, I have a problem that talon responds with strange HTML entities in text when using extract_from_html.

File I used to reproduce it

Here I use Polish ł character:

quotations.extract_from_html('Napisał(a):\n<blockquote><span>x</span></blockquote>')

and I get response:

<html><head></head><body>Napisa&#197;&#8218;(a):
</body></html>

these entities map to:

&#197;  => Å
&#8218; => ‚

What's even stranger, when I replace x with ł inside blockquote, it responds with:

<html><head></head><body>Napisa&#322;(a):
</body></html>

where &#322; is, indeed, entity for ł character I would expect, so text would show correctly on website.

janwirth commented 2 years ago

I have the same issue, how did you solve it?

janwirth commented 2 years ago

I fixed it by encoding the string to bytes as unicode after reading this stackoverflow post.

quotations.extract_from(email_message.html.encode("iso-8859-1"), 'text/html')

The output went from

<html><head></head><body><div dir="ltr">Yes, I got your email.&#194;&#160;<br></div><br></body></html>

to

<html><head></head><body><div dir="ltr">Yes, I got your email.&#160;<br></div><br></body></html>

The culprit &#194; is now gone.