Specify encoding as utf-8 explicitly when opening files

mhmd-azeez commented 2 years ago

Background:

I am trying to use KLPT from .NET! I am using https://github.com/pythonnet/pythonnet to do that and it works wonderfully. Look!

The problem I have is, whenever the python code wants to read from a file, it encounters an encoding problem:

''charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>'

My understanding is this might because the default encoding in .NET is UTF-16 and the files are in UTF-8.

Fix:

The fix is very simple: Since the data files are all in utf-8, every time we open a file, we need to specify 'utf-8' as the encoding.

sinaahmadi commented 2 years ago

Thanks for the pull request, Muhammad. Although the code looks fine to me, I am wondering if you run the test units. Did all the test units pass?

mhmd-azeez commented 2 years ago

Hahaha I am embarassed, but I had forgotten to run the tests. On Linux, test_stem fails:

mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python tests/test_preprocess.py

Command 'python' not found, did you mean:

  command 'python3' from deb python3
  command 'python' from deb python-is-python3

mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_preprocess.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.040s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.006s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_stem.py
F
======================================================================
FAIL: test_analyze (__main__.TestStem)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_stem.py", line 41, in test_analyze
    self.assertEqual(stemmer.lemmatize(test_case), self.test_cases["lemmatize"][dialect][script][test_case])
AssertionError: Lists differ: ['پاڵاوتن', 'پاڵافتن'] != ['پاڵافتن', 'پاڵاوتن']

First differing element 0:
'پاڵاوتن'
'پاڵافتن'

- ['پاڵاوتن', 'پاڵافتن']
+ ['پاڵافتن', 'پاڵاوتن']

----------------------------------------------------------------------
Ran 1 test in 0.080s

FAILED (failures=1)
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.184s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
latin_to_arabic Latin
.
----------------------------------------------------------------------
Ran 1 test in 0.020s

OK

On Windows multiple tests fail:

Mo in D:\code\klpt on master λ python .\tests\test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.003s

OK
Mo in D:\code\klpt on master λ python .\tests\test_preprocess.py
EEEE
======================================================================
ERROR: test_normalizer (__main__.TestPreprocess)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\code\klpt\tests\test_preprocess.py", line 28, in test_normalizer
    prep = Preprocess(dialect, script)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
    self.preprocess_map = json.load(preprocess_file)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>

======================================================================
ERROR: test_standardizer (__main__.TestPreprocess)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\code\klpt\tests\test_preprocess.py", line 37, in test_standardizer
    prep = Preprocess(dialect, script)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
    self.preprocess_map = json.load(preprocess_file)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>

======================================================================
ERROR: test_stopwords (__main__.TestPreprocess)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\code\klpt\tests\test_preprocess.py", line 53, in test_stopwords
    prep = Preprocess(dialect, script)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
    self.preprocess_map = json.load(preprocess_file)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>

======================================================================
ERROR: test_unify_numerals (__main__.TestPreprocess)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\code\klpt\tests\test_preprocess.py", line 45, in test_unify_numerals
    prep = Preprocess("Sorani", "Latin", numeral)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
    self.preprocess_map = json.load(preprocess_file)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>

----------------------------------------------------------------------
Ran 4 tests in 0.008s

FAILED (errors=4)
Mo in D:\code\klpt on master λ python .\tests\test_stem.py
F
======================================================================
FAIL: test_analyze (__main__.TestStem)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\code\klpt\tests\test_stem.py", line 41, in test_analyze
    self.assertEqual(stemmer.lemmatize(test_case), self.test_cases["lemmatize"][dialect][script][test_case])
AssertionError: Lists differ: ['پاڵاوتن', 'پاڵافتن'] != ['پاڵافتن', 'پاڵاوتن']

First differing element 0:
'پاڵاوتن'
'پاڵافتن'

- ['پاڵاوتن', 'پاڵافتن']
+ ['پاڵافتن', 'پاڵاوتن']

----------------------------------------------------------------------
Ran 1 test in 0.100s

FAILED (failures=1)
Mo in D:\code\klpt on master λ python .\tests\test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.139s

OK
Mo in D:\code\klpt on master λ python .\tests\test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
E
======================================================================
ERROR: test_transliterator (__main__.TestTransliterator)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\code\klpt\tests\test_transliterate.py", line 37, in test_transliterator
    wergor = Transliterate("Sorani", source_script, target_script, numeral=numeral)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\transliterate.py", line 113, in __init__
    self.prep = Preprocess("Sorani", "Latin", numeral="Latin")
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
    self.preprocess_map = json.load(preprocess_file)
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>

----------------------------------------------------------------------
Ran 1 test in 0.003s

FAILED (errors=1)

I will investigate to see what's going on. I'll also probably send another PR to add a github action to run the tests on every push so that this won't be necessary in the future

mhmd-azeez commented 2 years ago

Okay, So I took a look at the test that fails on linux and it's just the order of the results that differs. Correct me if I am wrong but we don't care about the order of the lemmas, right? So I changed the unit tests to use assertCountEqual instead and all tests succeed on Linux:

mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_preprocess.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.025s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_stem.py
.
----------------------------------------------------------------------
Ran 1 test in 1.229s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.159s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
latin_to_arabic Latin
.
----------------------------------------------------------------------
Ran 1 test in 0.013s

OK

I'll investigate Windows next

mhmd-azeez commented 2 years ago

So after much frustration and debugging, I re-read the Windows errors and I realized that it's using the pip package (which doesn't specify the encoding), not the source code 😅. After I uinstalled the pip package, it started working again. This is what I get for not using virtual environments hahaha. It all seems to be working for me and I think we are good to go 👍:

Running on linux:

mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_configuration.py 
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_preprocess.py 
....
----------------------------------------------------------------------
Ran 4 tests in 0.027s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_stem.py 
.
----------------------------------------------------------------------
Ran 1 test in 1.362s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.180s

OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
latin_to_arabic Latin
.
----------------------------------------------------------------------
Ran 1 test in 0.013s

OK

Running on Windows:

Mo in D:\code\klpt on master ● λ python .\tests\test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK
Mo in D:\code\klpt on master ● λ python .\tests\test_preprocess.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.019s

OK
Mo in D:\code\klpt on master ● λ python .\tests\test_stem.py
.
----------------------------------------------------------------------
Ran 1 test in 1.230s

OK
Mo in D:\code\klpt on master ● λ python .\tests\test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.151s

OK
Mo in D:\code\klpt on master ● λ python .\tests\test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
latin_to_arabic Latin
.
----------------------------------------------------------------------
Ran 1 test in 0.008s

OK

mhmd-azeez commented 2 years ago

Btw, the test action is ready (you can see the results here: https://github.com/DevelopersTree/klpt/actions/runs/1673406456). I will send you another PR once you merge this one. This way, you won't have to worry about running tests again. Later on we can also think about automatically publishing to pip when you push to master. But that's a topic for another day

sinaahmadi / klpt

Specify encoding as utf-8 explicitly when opening files #15

Background:

Fix: