Closed makwadajp closed 3 years ago
Hey, thanks for raising this issue. This project is close to 5 years old, and I have moved on in my research direction. If I were to redo this project, I would not use buckets at all, since G2P sequences are fairly short. I would suggest adapting this code and removing the dependence on buckets altogether.
Thank you for replying. It seems that you've updated the code with attention roughly 2 years ago but I see and I understand. Thank you again.
Although CMUDict setup doesn't raise an exception, I tried it with other dataset and I believe there is a bug in
seq2seq_model.py
get_batch(self, data, bucket_id=None)
method. Specifically I believe there is a case when thedecoder_pad_size
becomes < 0 whenself.isTraining
is false at the following code:decoder_pad_size = max_len_target - (len(decoder_input) + 1)
When
decoder_pad_size
is negative the following error is raised:I believe the following is the culprit. It should be with + 1
For your information, here are the conditions when ValueError occurs:
With above condition, the original code creates
decoder_inputs
with the shape of 256 [FLAGS.batch_size] x 36, when it should be 256 [FLAGS.batch_size] x 37 ([data_utils.GO_ID] + decoder_input + [data_utils.EOS_ID] + [data_utils.PAD_ID] * 0)As an extra information, my G2P dataset is not English and has a Maximum input sequence length of 65 and Maximum output sequence length of 97. While the above fix of + 1 seem to do the trick (no more ValueError), should I be concerned with other parameters (e.g.
_buckets = [(35, 35)]
indata_utils.py
? I read your comment regarding the bucket but the link you mention is broken: http://goo.gl/d8ybpl).