jsvine / markovify

A simple, extensible Markov chain generator.
MIT License
3.3k stars 349 forks source link

Test fails straight out of the box #126

Open JGCoelho opened 4 years ago

JGCoelho commented 4 years ago

I've cloned the repository, and tried running the unittest test.test_itertext. This test doesn't require to set up the sherlock model. It reads the text files that come with the package and makes the models inside the test, so i didn't have any input into it. The error i keep getting is this:

(base) C:\Users\JGC\Desktop\Trabalhos\Python\markovify>python -m unittest test.test_itertext
EE.E
======================================================================
ERROR: test_from_json_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 25, in test_from_json_without_retaining 
    original_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_from_mult_files_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 37, in test_from_mult_files_without_retaining
    models.append(markovify.Text(f, retain_original=False))
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 18, in test_without_retaining
    senate_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

----------------------------------------------------------------------
Ran 4 tests in 0.725s

FAILED (errors=3)

Running a conda 3.7.6 environment on Windows 10.

jsvine commented 4 years ago

Thanks for flagging @JGCoelho. Judging by the error messages, this seems to be an issue with character encoding — possibly tied to Windows and/or Anaconda, but it's hard to tell. If you run the tests with a standard Python installation, instead of Anaconda, do you get the same problem? And can anyone else replicate these errors?

JGCoelho commented 4 years ago

Tried cloning it again and running the unittest with the default python 3.8.2. Same errors:

C:\Users\JGC\Desktop>git clone https://github.com/jsvine/markovify.git
Cloning into 'markovify'...
remote: Enumerating objects: 32, done.
remote: Counting objects: 100% (32/32), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 834 (delta 16), reused 10 (delta 2), pack-reused 802
Receiving objects: 100% (834/834), 461.29 KiB | 1.43 MiB/s, done.
Resolving deltas: 100% (495/495), done.

C:\Users\JGC\Desktop>cd markovify

C:\Users\JGC\Desktop\markovify>py --version
Python 3.8.2

C:\Users\JGC\Desktop\markovify>py -m unittest test.test_itertext
EE.E
======================================================================
ERROR: test_from_json_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 24, in test_from_json_without_retaining
    original_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_from_mult_files_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 36, in test_from_mult_files_without_retaining
    models.append(markovify.Text(f, retain_original=False))
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

======================================================================
ERROR: test_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 17, in test_without_retaining
    senate_model = markovify.Text(f, retain_original=False)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
    parsed = parsed_sentences or self.generate_corpus(input_text)
  File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
    for line in text:
  File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>

----------------------------------------------------------------------
Ran 4 tests in 0.515s

FAILED (errors=3)

Maybe a problem with codecs? Opening the files sherlock.txt and senate-bills.txt i could see that they had the format utf-8 without BOM. Converted them to utf-8 with BOM and got the same error. Also converted the format to ANSI and UCS-2 to no avail.

JGCoelho commented 4 years ago

Also, the character 0x9d is the 'RIGHT DOUBLE QUOTATION MARK' (U+201D) 0x9D.

Sylv-Lej commented 4 years ago

0x9d is unmapped in windows-1252 according to wikipedia