UnicodeEncodeError: 'charmap' codec can't encode character. character maps to <undefined>

minimaxir / gpt-2-simple

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Other

3.4k stars 677 forks source link

UnicodeEncodeError: 'charmap' codec can't encode character. character maps to <undefined> #213

Open ahmadalli opened 4 years ago

ahmadalli commented 4 years ago

I'm having the error when gpt2.finetune tries to generate samples. Dataset loading is fine (which was the issue on #9)

This is the complete error text:

Traceback (most recent call last):
  File ".\persian.py", line 39, in <module>
    save_every=500
  File "E:\ai\GPT-2\envs\lib\site-packages\gpt_2_simple\gpt_2.py", line 331, in finetune
    generate_samples()
  File "E:\ai\GPT-2\envs\lib\site-packages\gpt_2_simple\gpt_2.py", line 306, in generate_samples
    fp.write('\n'.join(all_text))
  File "E:\ai\GPT-2\envs\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0631' in position 28: character maps to <undefined>

ahmadalli commented 4 years ago

the files are encoded in utf-8 and LF line ending

syn-chromatic commented 4 years ago

It seems to fail when it tries to save the generated samples into a file

Defining the encoding in Line 303 in gpt_2.py seems to have fixed the issue

    with open(
            os.path.join(SAMPLE_DIR, run_name,
                         'samples-{}').format(counter), 'w', encoding='utf8', errors='ignore') as fp:
        fp.write('\n'.join(all_text))

axfelix commented 3 years ago

Thanks @Syn08 -- this fix should be merged upstream if possible!

FlashlightET commented 1 year ago

i still had to manually edit the code to fix this, was never merged

ahmadalli commented 1 year ago

i still had to manually edit the code to fix this, was never merged

wasn't it fixed on #290?

Flightkick commented 1 year ago

i still had to manually edit the code to fix this, was never merged

wasn't it fixed on #290?

It was merged to master but not released. The current latest version 0.8.1 does not have this fix included.

Technerder commented 1 year ago

I've run into this issue a few times, I saw the related code while I was trying to find the issue but never thought to see if it was released.