tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.39k stars 1.96k forks source link

'TypeError: can't concat str to bytes' after last commits #182

Open daniel-kukiela opened 6 years ago

daniel-kukiela commented 6 years ago

Hi, After commit 58abf51aaf637962b0a5342afcd480af5cda7227 i'm unable to run training with error:

...
  File "D:\seq2seq_test3\nmt\nmt\train.py", line 449, in _sample_decode
    utils.print_out(b"    src: " + src_data[decode_id])
TypeError: can't concat str to bytes

So... probably fix doesn't work? :)

Regards, Daniel

smsrikanthreddy commented 6 years ago

Thanks @daniel-kukiela , Your temporary fix worked

daniel-kukiela commented 6 years ago

So we have another encoding-related issue you were attempring to fix.

Fix attempt of original issue broke trainer part, like i described above. I got that (encoding-related) issue wrong for first time when @sentdex showed it to me. The issue we are facing here is that some characters can't be encoded with current stdout encoding (and is true for Python versions < 3.6 using Windows, and maybe some installations of other OS-es).

stdout console encoding in Windows changed with Python 3.6 (https://www.python.org/dev/peps/pep-0528/) Issue using Python 3.5 on Windows 10: https://i.gyazo.com/907abdea295477595fa97bd0e56f220d.png

So i think, that better way to fix original issue is to change:

  out_s = s.encode("utf-8")
  if not isinstance(out_s, str):
    out_s = out_s.decode("utf-8")

(lines 64-66 in utils/misc_utils.py, function name: print_out) to:

  out_s = s.encode(sys.stdout.encoding, "backslashreplace"))
  if not isinstance(out_s, str):
    out_s = out_s.decode(sys.stdout.encoding, "backslashreplace"))

and stop assuming utf-8 as stdout encoding. That will also ensure, that every string will be printed out just fine.

Also, to fix issue caused by last commit (attempting to fix original issue): change:

  utils.print_out(b"    src: " + src_data[decode_id])
  utils.print_out(b"    ref: " + tgt_data[decode_id])

(lines 449-450 in train.py, function name: _sample_decode) to:

  utils.print_out(b"    src: " + src_data[decode_id].encode("utf-8"))
  utils.print_out(b"    ref: " + tgt_data[decode_id].encode("utf-8"))

Regards, Daniel

oahziur commented 6 years ago

@daniel-kukiela Thanks for the comments. I reverted the previous attempt.

I think we may only need apply this change in your suggestion on the head to fix the issue:

(lines 64-66 in utils/misc_utils.py, function name: print_out)

  out_s = s.encode(sys.stdout.encoding, "backslashreplace"))
  if not isinstance(out_s, str):
    out_s = out_s.decode(sys.stdout.encoding, "backslashreplace"))

Is this bug Windows specific? Have you tried it on other OS?

daniel-kukiela commented 6 years ago

We experienced that issue only by using Windows OS. That bug will be a case for any supported OS and supported Python version where Python is using non-utf8 console encoding. I'm not sure if that's a case for any combination other than Windows + Python 3.5 (console encoding had been changed to utf8 for Python 3.6 for Windows: https://www.python.org/dev/peps/pep-0528/).

And yes - that one change will be sufficient now.

Regards, Daniel