sherjilozair / char-rnn-tensorflow

Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow
MIT License
2.64k stars 959 forks source link

Does not word with other language #113

Open ShuvenduRoy opened 6 years ago

ShuvenduRoy commented 6 years ago

Where I have to change to support UTF-8. so that I can train it on other languages

john-parton commented 6 years ago

It actually should work with utf-8 if you're using the latest version.

What are your versions:

Thanks.

lowtronik commented 6 years ago

Actually the sample outputs my Greek text as raw utf-8 , " \xcf\xce\x83\, \xb1\xb9\ .........."

ShuvenduRoy commented 6 years ago

@lowtronik that hex format. just decode it result.decode("utf-8", "replace")

lowtronik commented 6 years ago

@ShuvenduBikash I just deleted .encode('utf-8') and it works

foocp commented 6 years ago

I have the same problem, it generates raw text like this

\xc3\xa8p

However if I follow your suggestion and delete .encode('utf-8') it fails with this error:

UnicodeEncodeError: 'ascii' codec can't encode character '\u201c' in position 444: ordinal not in range(128)