In the NMT paper, pg 13-14, they've defined the model architecture. In the Decoder section (A 2.2), they've proposed probability of a target word w.r.t the deep output (ti~) with a single maxout hidden layer.
But if I look at the source code where the graph is built, it calls the _build_decoder function where the decoder section is built, I could not find the implementation of these equations:
Can someone clarify as to how those equations are included in the nmt implementation?
In the NMT paper, pg 13-14, they've defined the model architecture. In the Decoder section (A 2.2), they've proposed probability of a target word w.r.t the deep output (ti~) with a single maxout hidden layer.
But if I look at the source code where the graph is built, it calls the _build_decoder function where the decoder section is built, I could not find the implementation of these equations:
Can someone clarify as to how those equations are included in the nmt implementation?