Issue with train.py - chatset errors.

pender / chatbot-rnn

A toy chatbot powered by deep learning and trained on data from Reddit

MIT License

899 stars 370 forks source link

Issue with train.py - chatset errors. #44

Open ghost opened 6 years ago

ghost commented 6 years ago

Any thoughts? I am using windows..

Preprocessing file 2/6 (reddit-parse/output\output 1.bz2)... Traceback (most recent call last): File "train.py", line 190, in <module> main() File "train.py", line 49, in main train(args) File "train.py", line 55, in train data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length) File "D:\bot\utils.py", line 39, in __init__ self._preprocess(self.input_files[i], self.tensor_file_template.format(i)) File "D:\bot\utils.py", line 107, in _preprocess data = file_reference.read() File "D:\python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 23267: character maps to <undefined>

sashasmirnova commented 6 years ago

hi, I'm having the same problem when I'm running train.py on new data.

neofob commented 6 years ago

This might not be the right solution but...here is a patch for that. https://github.com/neofob/chatbot-rnn/commit/1f56cb941b834c5bc95c8f40fe58ce08277f4d10

zhou-daniel-dz commented 6 years ago

Yea @neofob changes the encodings the utils are using to read the training sets, but this should match which encodings you used to write the training data as well. (i.e if your training files are encoded with utf-8, they should be read in utf-8)

Although this allows for training I'm not too sure if the char-rnn works with utf-8 encodings at all since I am just getting gibberish back from the model when trained this way. (https://github.com/karpathy/char-rnn/pull/113)

geroale commented 6 years ago

Any news? Same problem here.

The @neofob patch doesn't work for me: I guess it's because bz2.open errors="ignore" or errors="replace" param is not working.

I am using the same @pender reddit dataset (https://github.com/pender/chatbot-rnn)

zhou-daniel-dz commented 6 years ago

You just need to make sure the data you're training on is encoded in ANSI.

If your parser must read and write in a different encoding, just save the output text file as ANSI and it should be useable. Clearly certain characters cannot be mapped, but the percentage of those characters seems too small to make a difference.

remotejob commented 5 years ago

@neofob @zhou-daniel-dz I try figure out how make char-rnn work with utf-8 but simple path in: utils.py if input_file.endswith(".bz2"): file_reference = bz2.open(input_file, mode='rt', encoding="utf-8", errors="replace") elif input_file.endswith(".txt"): file_reference = io.open(input_file, mode='rt', encoding="utf-8", errors="replace") Don't work for me probably it's not enough?

breadbrowser commented 2 years ago

no just bad or wrong format

breadbrowser commented 2 years ago

of bz2 or txt file or file renamed from zst