scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
104 stars 23 forks source link

Unicode error in pytest for Hindi, Spansish and Russian in Windows. #66

Closed AmPhIbIaN26 closed 2 years ago

AmPhIbIaN26 commented 3 years ago

To recreate this run pytest in Windows. This error is taken from test_language_hi.py

tests\test_language_hi.py:42 (test_parse_number_till_hundred)
def test_parse_number_till_hundred():
>       _test_files(HUNDREDS_DIRECTORY, LANG)

test_language_hi.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
__init__.py:25: in _test_files
    for row in csv_reader:
~\Python\Python39\lib\csv.py:110: in __next__
    self.fieldnames
~\Python\Python39\lib\csv.py:97: in fieldnames
    self._fieldnames = next(self.reader)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <encodings.cp1252.IncrementalDecoder object at 0x000001743C3E54F0>
input = b'number,text\n0,\xe0\xa4\xb6\xe0\xa5\x82\xe0\xa4\xa8\xe0\xa5\x8d\xe0\xa4\xaf\n1,\xe0\xa4\x8f\xe0\xa4
\x95\n2,\xe0\xa4\...x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa4\xac\xe0\xa5\x87\n100,\xe0\xa4\x8f\xe0\xa4
\x95 \xe0\xa4\xb8\xe0\xa5\x8c'
final = False

    def decode(self, input, final=False):
>       return codecs.charmap_decode(input,self.errors,decoding_table)[0]
E       UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 25: character maps to <undefined>

~\Python\Python39\lib\encodings\cp1252.py:23: UnicodeDecodeError
FAILED               [100%]
tests\test_language_hi.py:46 (test_parse_number_permutations)
def test_parse_number_permutations():
>       _test_files(PERMUTATION_DIRECTORY, LANG)

test_language_hi.py:48: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
__init__.py:25: in _test_files
    for row in csv_reader:
~\Python\Python39\lib\csv.py:110: in __next__
    self.fieldnames
~\Python\Python39\lib\csv.py:97: in fieldnames
    self._fieldnames = next(self.reader)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <encodings.cp1252.IncrementalDecoder object at 0x000001743C477700>
input = b'number,text\r\n1234,\xe0\xa4\x8f\xe0\xa4\x95 \xe0\xa4\xb9\xe0\xa4\x9c\xe0\xa4\xbe\xe0\xa4
\xb0 \xe0\xa4\xa6\xe0\xa5\x...0\xa4\xb8\xe0\xa5\x8c \xe0\xa4\xaa\xe0\xa5\x88\xe0\xa4\x82\xe0\xa4\xa4
\xe0\xa4\xbe\xe0\xa4\xb2\xe0\xa5\x80\xe0\xa4\xb8'
final = False

    def decode(self, input, final=False):
>       return codecs.charmap_decode(input,self.errors,decoding_table)[0]
E       UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 20: character maps to <undefined>

~\Python\Python39\lib\encodings\cp1252.py:23: UnicodeDecodeError

This only happens in Windows not linux. This can be solved by adding encoding='utf8' to the open() function on line 23 in __init__.py