sc3 / cookcountyjail

A Django app that tracks the population of Cook County Jail over time and summarizes trends.
http://cookcountyjail.recoveredfactory.net/api/1.0/?format=json
Other
31 stars 23 forks source link

Scraper failed with ASCII encoding error message #440

Closed nwinklareth closed 10 years ago

nwinklareth commented 10 years ago

Relevant part of logfile: DEBUG: 2014-06-06 08:30:09.353721 - Inmate: Updated inmate 2014-0604301 Traceback (most recent call last): File "/home/ubuntu/.virtualenvs/cookcountyjail/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run result = self._run(_self.args, *_self.kwargs) File "/home/ubuntu/apps/cookcountyjail/scraper/concurrent_base.py", line 50, in _process_commands func(args) File "/home/ubuntu/apps/cookcountyjail/scraper/inmates.py", line 25, in _create_update_inmate self.__raw_inmate_data.add(args['inmate_details']) File "/home/ubuntu/apps/cookcountyjail/scraper/raw_inmate_data.py", line 56, in add self.__build_file_writer.writerow(inmate_info) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 115: ordinal not in range(128) <Greenlet at 0x32e8eb0: <bound method Inmates._process_commands of <scraper.inmates.Inmates object at 0x33de4d0>>> failed with UnicodeEncodeError

The character u'\xa0' is the value given to the HTML entity  

The CSV package expects values to be ASCII only, meaning that their ordinal value is between 0 and 127.

nwinklareth commented 10 years ago

The simplest fix is to have InmateDetails replace u'\xa0' with a space. Which is what will be done, howver a better long term solution should be found.

bepetersn commented 10 years ago

That's part of the court location. It needs to be normalized to a space character. On Jun 6, 2014 10:04 AM, "nwinklareth" notifications@github.com wrote:

Relevant part of logfile: DEBUG: 2014-06-06 08:30:09.353721 - Inmate: Updated inmate 2014-0604301 Traceback (most recent call last): File "/home/ubuntu/.virtualenvs/cookcountyjail/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run result = self._run(_self.args, *_self.kwargs) File "/home/ubuntu/apps/cookcountyjail/scraper/concurrent_base.py", line 50, in _process_commands func(args) File "/home/ubuntu/apps/cookcountyjail/scraper/inmates.py", line 25, in _create_update_inmate self.__raw_inmate_data.add(args['inmate_details']) File "/home/ubuntu/apps/cookcountyjail/scraper/raw_inmate_data.py", line 56, in add self.__build_file_writer.writerow(inmate_info) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 115: ordinal not in range(128)

failed with UnicodeEncodeError

The character u'\xa0' is the value given to the HTML entity

The CSV package expects values to be ASCII only, meaning that their ordinal value is between 0 and 127.

— Reply to this email directly or view it on GitHub https://github.com/sc3/cookcountyjail/issues/440.

bepetersn commented 10 years ago

We could use unicodecsv (https://github.com/jdunck/python-unicodecsv), a drop-in replacement for the CSV module with Unicode support. On Jun 6, 2014 10:07 AM, "Brian Peterson" bepetersn@gmail.com wrote:

That's part of the court location. It needs to be normalized to a space character. On Jun 6, 2014 10:04 AM, "nwinklareth" notifications@github.com wrote:

Relevant part of logfile: DEBUG: 2014-06-06 08:30:09.353721 - Inmate: Updated inmate 2014-0604301 Traceback (most recent call last): File "/home/ubuntu/.virtualenvs/cookcountyjail/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run result = self._run(_self.args, *_self.kwargs) File "/home/ubuntu/apps/cookcountyjail/scraper/concurrent_base.py", line 50, in _process_commands func(args) File "/home/ubuntu/apps/cookcountyjail/scraper/inmates.py", line 25, in _create_update_inmate self.__raw_inmate_data.add(args['inmate_details']) File "/home/ubuntu/apps/cookcountyjail/scraper/raw_inmate_data.py", line 56, in add self.__build_file_writer.writerow(inmate_info) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 115: ordinal not in range(128)

failed with UnicodeEncodeError

The character u'\xa0' is the value given to the HTML entity

The CSV package expects values to be ASCII only, meaning that their ordinal value is between 0 and 127.

— Reply to this email directly or view it on GitHub https://github.com/sc3/cookcountyjail/issues/440.

nwinklareth commented 10 years ago

Perhaps using that csv package is the long term fix. However that character, u'\xa0' has plagued us in a number of places so I am going to convert for now to a ordinary space in InmateDetails

On Fri, Jun 6, 2014 at 10:15 AM, Brian Everett Peterson < notifications@github.com> wrote:

We could use unicodecsv (https://github.com/jdunck/python-unicodecsv), a drop-in replacement for the CSV module with Unicode support. On Jun 6, 2014 10:07 AM, "Brian Peterson" bepetersn@gmail.com wrote:

That's part of the court location. It needs to be normalized to a space character. On Jun 6, 2014 10:04 AM, "nwinklareth" notifications@github.com wrote:

Relevant part of logfile: DEBUG: 2014-06-06 08:30:09.353721 - Inmate: Updated inmate 2014-0604301 Traceback (most recent call last): File

"/home/ubuntu/.virtualenvs/cookcountyjail/local/lib/python2.7/site-packages/gevent/greenlet.py",

line 327, in run result = self._run(_self.args, *_self.kwargs) File "/home/ubuntu/apps/cookcountyjail/scraper/concurrent_base.py", line 50, in _process_commands func(args) File "/home/ubuntu/apps/cookcountyjail/scraper/inmates.py", line 25, in _create_update_inmate self.__raw_inmate_data.add(args['inmate_details']) File "/home/ubuntu/apps/cookcountyjail/scraper/raw_inmate_data.py", line 56, in add self.__build_file_writer.writerow(inmate_info) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 115: ordinal not in range(128)

failed with UnicodeEncodeError

The character u'\xa0' is the value given to the HTML entity

The CSV package expects values to be ASCII only, meaning that their ordinal value is between 0 and 127.

— Reply to this email directly or view it on GitHub https://github.com/sc3/cookcountyjail/issues/440.

— Reply to this email directly or view it on GitHub https://github.com/sc3/cookcountyjail/issues/440#issuecomment-45348168.

Regards

Norbert

Norbert Winklareth

bepetersn commented 10 years ago

Closing for now.