salesforce / decaNLP

The Natural Language Decathlon: A Multitask Challenge for NLP
BSD 3-Clause "New" or "Revised" License
2.34k stars 474 forks source link

Question about the paper: Why is answer representation needed for training? #38

Closed howardyclo closed 5 years ago

howardyclo commented 5 years ago

Hi, I am confused about the input of training and testing.

It's reasonable for me that the output distribution is the answers prediction, and we train it with minimizing the negative log loss between output distribution and the ground truth answer. But what I'm confused is that the answer is fed into the model for computing answer representation mentioned in the paper. Why is this? Aren't the answers unknown in testing phase?

By the way, there's a typo in your paper. In equation (10), the ℝ^n for p_c should be ℝ^l. Right?

Thanks.

bmccann commented 5 years ago

Should definitely be the dimension of the context for p_c and not the dimension of the answer. Thanks for pointing that out!

Re: training with the answer, forgive me if I write about stuff you already know. We're doing something called teacher forcing, so at every time step during training we are going to feed in the correct token from the time step before rather than the predicted token from the time step before. This helps train the model more efficiently when we're using this NLL/XE (neg-log-likelihood/cross-entropy) setup. So we have the answer as input in order to use the correct tokens instead of the predicted one. For the part of the decoder that is a transformer, this also allows us to decode much faster during training. Without the recurrence of the RNN, we can use the ground truth answer to compute all decoder outputs (for all time steps) in parallel as long as we mask out the future correctly (you'll notice a causal flag in my transformer code; that is what that is about).

If I misunderstood your question or this doesn't seem like what we're actually doing in the code, please let me know (and feel free to repoen)