open-data / hnap

Script convered hand built HNAP XML samples to JSON-CKAN import samples
3 stars 1 forks source link

XML source is confused about text encoding. #1

Open wardi opened 9 years ago

wardi commented 9 years ago

e.g. in the XML:

<gco:CharacterString>TC - Canada&#226;&#8364;&#8482;s
    National Highway System</gco:CharacterString>

I used this sort of thing to clean it up:

from HTMLParser import HTMLParser
unescape = HTMLParser().unescape
confused = '''TC - Canada&#226;&#8364;&#8482;s
    National Highway System'''
print ' '.join(p.strip() for p in unescape(confused).encode('cp1252').decode('utf8').split(u'\n'))
TC - Canada’s National Highway System