opengovernment / opengovernment-local

Legislative information from local governments in the United States
MIT License
6 stars 1 forks source link

pa-philadelphia María Quiñones-Sánchez page contains unicode chars #9

Closed rchekaluk closed 11 years ago

rchekaluk commented 11 years ago

Error when Billy scrapes this page:

02:05:18 INFO scrapelib: GET - http://philadelphiacitycouncil.net/council-members/councilwoman-maria-d-quinones-sanchez-7th-district/councilwoman-maria-d-quinones-sanchez-contact/ Traceback (most recent call last): File "/u/apps/virtualenvs/billy/src/billy/billy/ext/ansistrm.py", line 56, in emit stream.write(message) UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 45: ordinal not in range(128) Logged from file init.py, line 177

Here are the obvious non-ascii chars I can see, indeed the first one is Unicode point F1:

Not certain, but I'm wondering if Billy might be constrained to the ascii character set:

billy$ find . -type f -print | xargs grep ascii | egrep -v git
./billy/importers/bills.py:            r.encode('ascii', 'replace') for r in remaining]))
./billy/importers/committees.py:            logger.debug("No matches for %s" % member['name'].encode('ascii',
./billy/web/api/emitters.py:                           ensure_ascii=False)
./billy/web/api/emitters.py:        return obj.encode("ascii", "replace")
./billy/scrape/bills.py:        return filename.encode('ascii', 'replace')
./billy/scrape/legislators.py:        return filename.encode('ascii', 'replace')
./billy/scrape/legislators.py:        return filename.encode('ascii', 'replace')
./billy/utils/fulltext.py:        text = text.encode('ascii', 'ignore')
./billy/utils/fulltext.py:        text = text.decode('utf8', 'ignore').encode('ascii', 'ignore')
rchekaluk commented 11 years ago

Fixed in billy https://github.com/opengovernment/billy/commit/8fd1500ec842840950606b28d1e2b8d74abcbaec Reference http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20/9942885#9942885