Open kaihuchen opened 7 years ago
Upon further investigation I found that the error came from line #164 in decode_text.py:
fetches["features.source_tokens"] = np.char.decode(
fetches["features.source_tokens"].astype("S"), 'utf-8')
If I change the statement to the following:
fetches["features.source_tokens"] = np.char.decode(
b''.join(fetches["features.source_tokens"][:-1]), 'utf-8')
then the problem went away, and the following statement:
print( fetches["features.source_tokens"] )
also displays the correct Unicode string from the test dataset. However, it is also found that the predicted output (i.e., fetches["predicted_tokens"]) contains nothing but a bunch of b'UNK' even though the training process appears to converge to a small loss of 0.01, and test data is in fact a subset of the original training data (used here just for testing this problem).
I wonder if anybody has successfully used seq2seq for training a character-level NMT model using Unicode dataset? The above evidence seems to indicate that the prediction phase shouldn't have worked, and the training part is kind of iffy. It is of course also possible that I have made a stupid mistake somewhere, and in which case any advise would be much appreciated.
To answer my own question on why training with character-level multibyte UTF-8 data results in a useless model: I believe the training phase has failed due to a bug in Tensorflow. There is a call to tf.string_split in seq2seq.data.split_tokens_decoder.py, and per Tensorflow API: If delimiter is an empty string, each element of the source is split into individual strings, each containing one byte. (This includes splitting multibyte sequences of UTF-8.), which is the wrong thing to do. Also found a Tensorflow pull request from just a few days ago which seems to confirm my reasoning.
I am seeing a problem somewhat similar to https://github.com/google/seq2seq/issues/170 but slightly different. In my case:
My observations:
Anybody has insight on how to deal with this problem?
Here is my training script:
Here is my prediction script:
Here is the trace from running the prediction script: