3a. Please do correct me if I'm getting it wrong, encoder_l* refers to the layers in the encoder side of the architecture, and decoder_l* refers to the layers in the decoder side.
3b. There are 6 layers so the numerical value after *_l\d_* is the layer number, starting from 1 to 6, indexing starts from 1.
3c. And in the decoder there's a decoder_ff_logit_out_b layer that takes the output from one of the decoder_l6_* layer and feeds it into the decoder_ff_logit_out_b to produce the indices that maps to the vocabulary. Is that correct?
3d. The *coder_l\d_self_* prefix refers to self attention layers for both the encoder and decoder.
3e. And the first layer that takes in the sentence matrix embeddings (e.g. output of (2a) point above) is the the encoder_l1_self_* layers.
3f. In attention, the QKV computations comes first and the weights from *coder_l\d_self_Wq, *coder_l\d_self_Wk, *coder_l\d_self_Wv and and biases from *coder_l\d_self_bq, *coder_l\d_self_bk, *coder_l\d_self_bv are representing the QKV memories from the self-attention layer.
So the output of the sentence matrix embedding is fed to these 6 layers
Cut-away question here, what does 'coder_l\d_self_Wo' and 'coder_l\d_self_bo' layer represent then? Is that the output layer after the QKV computation?
So after the QKV (self-attention) computation of the first layer, it gets fed into the encoder_l1_self_Wo and encoder_l1_self_bo layers?
What about these layers:
*coder_l\d_self_Wo_ln_bias_pre
*coder_l\d_self_Wo_ln_scale_pre
How are they used? Is it after the QKV computation before feeding into encoder_l1_self_Wo and encoder_l1_self_bo?
3g. After the getting the outputs from *coder_l\d_self_Wo and *coder_l\d_self_bo, it's passed to the *coder_l\d_ffn_* feedforward layers
E.g. the outputs of encoder_l1_self_Wo and encoder_l1_self_bo would be passed to encoder_l1_ffn_W1 and encoder_l1_ffn_b1 and subsequently encoder_l1_ffn_W2 and encoder_l1_ffn_b2.
But how about these layers:
'*coder_l\d_ffn_ffn_ln_bias_pre',
*ncoder_l\d_ffn_ffn_ln_scale_pre'
How are they used? Is it after applied after the coder_l\d_self_Woandcoder_l\d_self_boand before thecoder_l\dffn` layers?
3h. At the end of the encoder_l* layers (specifically encoder_l1_ffn_W2 and encoder_l1_ffn_b2), it's passed to the decoder_l2_* QKV layers. Is this correct?
Another cut-away: in that case the decoder_l1_* layers are the QKV and FFN computation from the decoder tokens, is that right?
In that case, if it's the start of the sentence, what gets fed into the decoder_Wemb and eventually into the decoder_l1_* layers? Is there a <s> padded at the empty state?
3i. After propagating the decoder layers from decoder_l2_* to decoder_l6_*, the final encoder_l6_ffn_W2 and encoder_l6_ffn_b2 gets fed into the decoder_ff_logit_out_b
Then it does an argmax and outputs the predicted vocabulary index at the current state. Is my understanding of that correct too?
4. And what about the *coder_l\d_context_* layers? Are those only use at decoding for attention?
In a typical Marian transformer model, what are the naming conventions for the layers?
We're trying to figure out this with more clarity, but here's some assumptions we're making.
1.
special:model.yml
is what's keeping the Marian yml config:[out]:
2.
encoder_Wemb
anddecoder_Wemb
is what's holding the encoder's and decoder's vocab lookup table.If we use a sentencepiece model, we can fetch the input matrix as such:
[out]:
3. Now the rest of the layers gets a little fuzzy.
Though the naming makes sense, we'll like to confirm whether we're understanding them correctly.
[out]:
3a. Please do correct me if I'm getting it wrong,
encoder_l*
refers to the layers in the encoder side of the architecture, anddecoder_l*
refers to the layers in the decoder side.3b. There are 6 layers so the numerical value after
*_l\d_*
is the layer number, starting from 1 to 6, indexing starts from 1.3c. And in the decoder there's a
decoder_ff_logit_out_b
layer that takes the output from one of thedecoder_l6_*
layer and feeds it into thedecoder_ff_logit_out_b
to produce the indices that maps to the vocabulary. Is that correct?3d. The
*coder_l\d_self_*
prefix refers to self attention layers for both the encoder and decoder.3e. And the first layer that takes in the sentence matrix embeddings (e.g. output of (2a) point above) is the the
encoder_l1_self_*
layers.3f. In attention, the QKV computations comes first and the weights from
*coder_l\d_self_Wq
,*coder_l\d_self_Wk
,*coder_l\d_self_Wv
and and biases from*coder_l\d_self_bq
,*coder_l\d_self_bk
,*coder_l\d_self_bv
are representing the QKV memories from the self-attention layer.So the output of the sentence matrix embedding is fed to these 6 layers
encoder_l1_self_Wq
,encoder_l1_self_Wk
,encoder_l1_self_Wv
encoder_l1_self_bq
,encoder_l1_self_bk
,encoder_l1_self_bv
Cut-away question here, what does 'coder_l\d_self_Wo' and 'coder_l\d_self_bo' layer represent then? Is that the output layer after the QKV computation?
So after the QKV (self-attention) computation of the first layer, it gets fed into the
encoder_l1_self_Wo
andencoder_l1_self_bo
layers?What about these layers:
*coder_l\d_self_Wo_ln_bias_pre
*coder_l\d_self_Wo_ln_scale_pre
How are they used? Is it after the QKV computation before feeding into
encoder_l1_self_Wo
andencoder_l1_self_bo
?3g. After the getting the outputs from
*coder_l\d_self_Wo
and*coder_l\d_self_bo
, it's passed to the*coder_l\d_ffn_*
feedforward layersE.g. the outputs of
encoder_l1_self_Wo
andencoder_l1_self_bo
would be passed toencoder_l1_ffn_W1
andencoder_l1_ffn_b1
and subsequentlyencoder_l1_ffn_W2
andencoder_l1_ffn_b2
.But how about these layers:
How are they used? Is it after applied after the coder_l\d_self_Wo
and
coder_l\d_self_boand before the
coder_l\dffn` layers?3h. At the end of the
encoder_l*
layers (specificallyencoder_l1_ffn_W2
andencoder_l1_ffn_b2
), it's passed to thedecoder_l2_*
QKV layers. Is this correct?Another cut-away: in that case the
decoder_l1_*
layers are the QKV and FFN computation from the decoder tokens, is that right?In that case, if it's the start of the sentence, what gets fed into the
decoder_Wemb
and eventually into thedecoder_l1_*
layers? Is there a<s>
padded at the empty state?3i. After propagating the decoder layers from
decoder_l2_*
todecoder_l6_*
, the finalencoder_l6_ffn_W2
andencoder_l6_ffn_b2
gets fed into thedecoder_ff_logit_out_b
Then it does an argmax and outputs the predicted vocabulary index at the current state. Is my understanding of that correct too?
4. And what about the
*coder_l\d_context_*
layers? Are those only use at decoding for attention?