neuniversity / ALY6140

1 stars 3 forks source link

How to solve 'utf-8' error with read_csv function? #33

Open aanranran opened 5 years ago

aanranran commented 5 years ago

I fail to input one of my databases in the JupyterNotebook when I use the code “read_csv" in pandas because of the following error:


UnicodeDecodeError Traceback (most recent call last) pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 6: invalid continuation byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError Traceback (most recent call last)

in () ----> 1 MisleInjury.txt = pd.read_csv('MisleInjury.txt',header=None,sep='\t') /anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision) 676 skip_blank_lines=skip_blank_lines) 677 --> 678 return _read(filepath_or_buffer, kwds) 679 680 parser_f.__name__ = name /anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 444 445 try: --> 446 data = parser.read(nrows) 447 finally: 448 parser.close() /anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1034 raise ValueError('skipfooter not supported for iteration') 1035 -> 1036 ret = self._engine.read(nrows) 1037 1038 ### May alter columns / col_dict /anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1846 def read(self, nrows=None): 1847 try: -> 1848 data = self._reader.read(nrows) 1849 except StopIteration: 1850 if self._first_chunk: pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert() pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8() ### UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 6: invalid continuation byte --------------------------------------------------------------------------- ### After I have done some researches on it , I change my code into as following stated but it still don’t work. Please check my following code and errors: ### MisleInjury.txt = pd.read_csv('MisleInjury.txt',encoding = "utf-8") --------------------------------------------------------------------------- ParserError Traceback (most recent call last) in () ----> 1 MisleInjury.txt = pd.read_csv('MisleInjury.txt',encoding = "utf-8") /anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision) 676 skip_blank_lines=skip_blank_lines) 677 --> 678 return _read(filepath_or_buffer, kwds) 679 680 parser_f.__name__ = name /anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 444 445 try: --> 446 data = parser.read(nrows) 447 finally: 448 parser.close() /anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1034 raise ValueError('skipfooter not supported for iteration') 1035 -> 1036 ret = self._engine.read(nrows) 1037 1038 ### May alter columns / col_dict /anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1846 def read(self, nrows=None): 1847 try: -> 1848 data = self._reader.read(nrows) 1849 except StopIteration: 1850 if self._first_chunk: pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error() ### ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2 --------------------------------------------------------------------------- ### How can I solve this error? I have attached the origin database in the email. Could you please help me with this situation? Thank you very much!
ThatkidfromA commented 5 years ago

Hi,

I did not understand the problem, but you should check pandas documentation with example. From your code above " 1 MisleInjury.txt = pd.read_csv('MisleInjury.txt',encoding = "utf-8")". You are reading txt file using the argument pd.read_csv. I think you should use Open().

I hope the article below can enlighten you. https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

NEUHU commented 5 years ago

I think the problem you encountered is similar to the one Sailu previously posted. UTF-8 stands for 8-bit Unicode Transformation Format, the reason you encountered that is that there is something the computer may not able to interpret with the default system. Try to the "encoding" code, see if this helps.

aanranran commented 5 years ago

Thanks everyone for your replies. I have solved my problem by changing my code as following:

MisleInjury = pd.read_csv(‘MisleInjury.txt’, encoding = "ISO-8859-1”, error_bad_lines=False, sep='\t’, header=None)`

This not only perfectly solved the error of ut-8 and tokenizing data but also spilt the text into columns. Hope my examples can provide additional feedbacks to other classmates who are in the same situation as me.

pr24 commented 5 years ago

pd.dataframe=("datalink",encoding='latin-1')