sfsinger19103 / election_anomaly

See more current version at https://github.com/ElectionDataAnalysis/election_data_analysis/
https://github.com/ElectionDataAnalysis/election_data_analysis/
10 stars 0 forks source link

Candidate.txt encoding issue #54

Open ericmtsai opened 4 years ago

ericmtsai commented 4 years ago

Ugly fail on encoding error in Candidate.txt file; should fail gracefully and allow user to enter encoding; also we should assume iso-8859-1 (which, unlike utf-8, can handle Spanish characters).

Checking Candidate.txt
File Candidate.txt has just been created.
Enter information in the file, then hit return to continue.
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1130, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1254, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas/_libs/parsers.pyx", line 1269, in pandas._libs.parsers.TextReader._string_convert
  File "pandas/_libs/parsers.pyx", line 1459, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 13: invalid start byte
ericmtsai commented 4 years ago

Capture this as part of the error checking? Not sure exactly what step this is in the process.

sfsinger19103 commented 4 years ago

Yes, error-checking should include check of encoding. Use

try:
     df = pd.read_csv('Candidate.txt',sep='\t')
except UnicodeEncodeError:
sfsinger19103 commented 4 years ago

Since we support English and Spanish letters now with the current default encoding iso-8859-1, my thinking is that will be fine for the beta release. I plan to push #54 off at least until I have the other issues addressed

Stephanie 11:40 AM Sounds reasonable. I’m curious: is the difficulty in the interactive piece allowing the user to specify the encoding? If not, where is the difficulty? New

eric 11:43 AM ... the larger issue is that these encodings errors can fail silently OR they can throw the error. If they fail silently, then they'll get written to the database as junk letters/characters. So it will be challenging to detect both possibilities. Additionally, from what I've read, it's challenging to detect the actual file encoding during run time. So this all led me to put it aside for now

Stephanie 11:46 AM If the junk characters are consistent, that might not constitute failure. I can live with a system that reports something like “138,392 votes for Maria Nu!@#$@ez in Congressional District 5”, since a user with local knowledge would likely be able to parse the answer.

sfsinger19103 commented 4 years ago

Since code now assumes iso-8859-1, no need to address this further for Beta Release.