Question about the paper: Why is answer representation needed for training?

Should definitely be the dimension of the context for p_c and not the dimension of the answer. Thanks for pointing that out!

Re: training with the answer, forgive me if I write about stuff you already know. We're doing something called teacher forcing, so at every time step during training we are going to feed in the correct token from the time step before rather than the predicted token from the time step before. This helps train the model more efficiently when we're using this NLL/XE (neg-log-likelihood/cross-entropy) setup. So we have the answer as input in order to use the correct tokens instead of the predicted one. For the part of the decoder that is a transformer, this also allows us to decode much faster during training. Without the recurrence of the RNN, we can use the ground truth answer to compute all decoder outputs (for all time steps) in parallel as long as we mask out the future correctly (you'll notice a causal flag in my transformer code; that is what that is about).

If I misunderstood your question or this doesn't seem like what we're actually doing in the code, please let me know (and feel free to repoen)

salesforce / decaNLP

Question about the paper: Why is answer representation needed for training? #38