remram44 / internetpoints

Gives internet points to mailing-list repliers!
Apache License 2.0
2 stars 0 forks source link

Convert HTML to text #4

Closed remram44 closed 10 years ago

remram44 commented 10 years ago

Some emails might be HTML, we need to convert that to a readable text version.

remram44 commented 10 years ago

BeautifulSoup is surprisingly bad at this. Any ideas?

html = '<p>T<i>e</i>st <b>haha</b></p><p>Other\nline</p>'

from bs4 import BeautifulSoup
BeautifulSoup(html).get_text()
# 'Test hahaOther\nline'
BeautifulSoup(html).get_text(' ')
# 'T e st  haha Other\nline'
BeautifulSoup(html).get_text('\n')
# 'T\ne\nst \nhaha\nOther\nline'
remram44 commented 10 years ago

Aaron Swartz's html2text seems close enough.

from html2text import HTML2Text
HTML2Text().handle(html)
# 'T_e_st **haha**\n\nOther line\n\n'