Is the Decoder like the Transformer Decoder, or just a layer?

microsoft / DeBERTa

The implementation of DeBERTa

MIT License

1.99k stars 228 forks source link

Closed hscspring closed 4 years ago

hscspring commented 4 years ago

As the title mentioned, I'm not sure that should we need to mask the future tokens just like the Transformer did in the Decoder?

I didn't find any answer in the paper or code. Is anyone who knows that? thanks.