Closed mhmd-azeez closed 2 years ago
Thanks for the pull request, Muhammad. Although the code looks fine to me, I am wondering if you run the test units. Did all the test units pass?
Hahaha I am embarassed, but I had forgotten to run the tests. On Linux, test_stem fails:
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python tests/test_preprocess.py
Command 'python' not found, did you mean:
command 'python3' from deb python3
command 'python' from deb python-is-python3
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_preprocess.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.040s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.006s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_stem.py
F
======================================================================
FAIL: test_analyze (__main__.TestStem)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_stem.py", line 41, in test_analyze
self.assertEqual(stemmer.lemmatize(test_case), self.test_cases["lemmatize"][dialect][script][test_case])
AssertionError: Lists differ: ['پاڵاوتن', 'پاڵافتن'] != ['پاڵافتن', 'پاڵاوتن']
First differing element 0:
'پاڵاوتن'
'پاڵافتن'
- ['پاڵاوتن', 'پاڵافتن']
+ ['پاڵافتن', 'پاڵاوتن']
----------------------------------------------------------------------
Ran 1 test in 0.080s
FAILED (failures=1)
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.184s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
latin_to_arabic Latin
.
----------------------------------------------------------------------
Ran 1 test in 0.020s
OK
On Windows multiple tests fail:
Mo in D:\code\klpt on master λ python .\tests\test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.003s
OK
Mo in D:\code\klpt on master λ python .\tests\test_preprocess.py
EEEE
======================================================================
ERROR: test_normalizer (__main__.TestPreprocess)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\code\klpt\tests\test_preprocess.py", line 28, in test_normalizer
prep = Preprocess(dialect, script)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
self.preprocess_map = json.load(preprocess_file)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>
======================================================================
ERROR: test_standardizer (__main__.TestPreprocess)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\code\klpt\tests\test_preprocess.py", line 37, in test_standardizer
prep = Preprocess(dialect, script)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
self.preprocess_map = json.load(preprocess_file)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>
======================================================================
ERROR: test_stopwords (__main__.TestPreprocess)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\code\klpt\tests\test_preprocess.py", line 53, in test_stopwords
prep = Preprocess(dialect, script)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
self.preprocess_map = json.load(preprocess_file)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>
======================================================================
ERROR: test_unify_numerals (__main__.TestPreprocess)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\code\klpt\tests\test_preprocess.py", line 45, in test_unify_numerals
prep = Preprocess("Sorani", "Latin", numeral)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
self.preprocess_map = json.load(preprocess_file)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>
----------------------------------------------------------------------
Ran 4 tests in 0.008s
FAILED (errors=4)
Mo in D:\code\klpt on master λ python .\tests\test_stem.py
F
======================================================================
FAIL: test_analyze (__main__.TestStem)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\code\klpt\tests\test_stem.py", line 41, in test_analyze
self.assertEqual(stemmer.lemmatize(test_case), self.test_cases["lemmatize"][dialect][script][test_case])
AssertionError: Lists differ: ['پاڵاوتن', 'پاڵافتن'] != ['پاڵافتن', 'پاڵاوتن']
First differing element 0:
'پاڵاوتن'
'پاڵافتن'
- ['پاڵاوتن', 'پاڵافتن']
+ ['پاڵافتن', 'پاڵاوتن']
----------------------------------------------------------------------
Ran 1 test in 0.100s
FAILED (failures=1)
Mo in D:\code\klpt on master λ python .\tests\test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.139s
OK
Mo in D:\code\klpt on master λ python .\tests\test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
E
======================================================================
ERROR: test_transliterator (__main__.TestTransliterator)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\code\klpt\tests\test_transliterate.py", line 37, in test_transliterator
wergor = Transliterate("Sorani", source_script, target_script, numeral=numeral)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\transliterate.py", line 113, in __init__
self.prep = Preprocess("Sorani", "Latin", numeral="Latin")
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\site-packages\klpt\preprocess.py", line 69, in __init__
self.preprocess_map = json.load(preprocess_file)
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Users\Mo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 344: character maps to <undefined>
----------------------------------------------------------------------
Ran 1 test in 0.003s
FAILED (errors=1)
I will investigate to see what's going on. I'll also probably send another PR to add a github action to run the tests on every push so that this won't be necessary in the future
Okay, So I took a look at the test that fails on linux and it's just the order of the results that differs. Correct me if I am wrong but we don't care about the order of the lemmas, right? So I changed the unit tests to use assertCountEqual
instead and all tests succeed on Linux:
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_preprocess.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.025s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.002s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_stem.py
.
----------------------------------------------------------------------
Ran 1 test in 1.229s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.159s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
latin_to_arabic Latin
.
----------------------------------------------------------------------
Ran 1 test in 0.013s
OK
I'll investigate Windows next
So after much frustration and debugging, I re-read the Windows errors and I realized that it's using the pip package (which doesn't specify the encoding), not the source code 😅. After I uinstalled the pip package, it started working again. This is what I get for not using virtual environments hahaha. It all seems to be working for me and I think we are good to go 👍:
Running on linux:
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.002s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_preprocess.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.027s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_stem.py
.
----------------------------------------------------------------------
Ran 1 test in 1.362s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.180s
OK
mo@DESKTOP-GFTK0J9:/mnt/d/code/klpt$ python3 tests/test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
latin_to_arabic Latin
.
----------------------------------------------------------------------
Ran 1 test in 0.013s
OK
Running on Windows:
Mo in D:\code\klpt on master ● λ python .\tests\test_configuration.py
.
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK
Mo in D:\code\klpt on master ● λ python .\tests\test_preprocess.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.019s
OK
Mo in D:\code\klpt on master ● λ python .\tests\test_stem.py
.
----------------------------------------------------------------------
Ran 1 test in 1.230s
OK
Mo in D:\code\klpt on master ● λ python .\tests\test_tokenize.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.151s
OK
Mo in D:\code\klpt on master ● λ python .\tests\test_transliterate.py
arabic_to_latin Arabic
arabic_to_latin Farsi
arabic_to_latin Latin
latin_to_arabic Latin
.
----------------------------------------------------------------------
Ran 1 test in 0.008s
OK
Btw, the test action is ready (you can see the results here: https://github.com/DevelopersTree/klpt/actions/runs/1673406456). I will send you another PR once you merge this one. This way, you won't have to worry about running tests again. Later on we can also think about automatically publishing to pip when you push to master. But that's a topic for another day
Background:
I am trying to use KLPT from .NET! I am using https://github.com/pythonnet/pythonnet to do that and it works wonderfully. Look!
The problem I have is, whenever the python code wants to read from a file, it encounters an encoding problem:
My understanding is this might because the default encoding in .NET is UTF-16 and the files are in UTF-8.
Fix:
The fix is very simple: Since the data files are all in utf-8, every time we open a file, we need to specify 'utf-8' as the encoding.