Question about LSTM training setting

harryjhnam commented 3 years ago

I'm trying to reproduce the baseline performance of Attentional LSTM in the paper.

Even though I use the same training hyper-parameters as the paper, I cannot get similar performance as the baseline. I guess the problem of my implementation is the ignored character. The paper said 96 characters are used, including one special token.

What I'm wondering is if there's no token, should the prediction inferred by the model be also padded with ignored characters? Then the loss function shouldn't ignore tokens, and I'm worried that the ignored character gives too much impact for training.

Also, if it's possible, can you share the training code for baseline results? It would be greatly helpful to me for reproducing the results of the paper. :) Thanks!

davidsaxton commented 3 years ago

IIRC the answers are fixed-length strings (30 characters?), including padding characters. I.e., the answer to "What is 2 + 2?" is "2··························" where "·" is a special padding character, and the model has to predict the padding characters in addition to the answer. (Note in reality, predicting padding is trivial for the model to do once it's predicted at least one padding character, and does not impact performance.)

Sorry, unable to release training code at the moment.

harryjhnam commented 3 years ago

Thank you for your reply! It helped me a lot.

google-deepmind / mathematics_dataset

Question about LSTM training setting #15