marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
255 stars 123 forks source link

[Question] What does the layer name means? #516

Closed alvations closed 4 years ago

alvations commented 4 years ago

In a typical Marian transformer model, what are the naming conventions for the layers?

We're trying to figure out this with more clarity, but here's some assumptions we're making.

1. special:model.yml is what's keeping the Marian yml config:

import yaml
import numpy as np

marian_model = np.load('model.npz')
marian_config = yaml.load(bytes(marian_model['special:model.yml']).decode('ascii')[:-1])

[out]:

{'bert-train-type-embeddings': True, 'bert-type-vocab-size': 2, 'dec-cell': 'gru', 'dec-cell-base-depth': 2, 'dec-cell-high-depth': 1, 'dec-depth': 6, 'dim-emb': 1024, 'dim-rnn': 1024, 'dim-vocabs': [32000, 32000], 'enc-cell': 'gru', 'enc-cell-depth': 1, 'enc-depth': 6, 'enc-type': 'bidirectional', 'input-types': [], 'layer-normalization': False, 'right-left': False, 'skip': False, 'tied-embeddings': True, 'tied-embeddings-all': False, 'tied-embeddings-src': False, 'transformer-aan-activation': 'swish', 'transformer-aan-depth': 2, 'transformer-aan-nogate': False, 'transformer-decoder-autoreg': 'self-attention', 'transformer-dim-aan': 2048, 'transformer-dim-ffn': 4096, 'transformer-ffn-activation': 'swish', 'transformer-ffn-depth': 2, 'transformer-guided-alignment-layer': 'last', 'transformer-heads': 8, 'transformer-no-projection': False, 'transformer-postprocess': 'da', 'transformer-postprocess-emb': 'd', 'transformer-preprocess': 'n', 'transformer-tied-layers': [], 'transformer-train-position-embeddings': False, 'type': 'transformer', 'ulr': False, 'ulr-dim-emb': 0, 'ulr-trainable-transformation': False, 'version': 'v1.7.8 63e1cfe4 2019-02-11 21:04:00 -0800'}

2. encoder_Wemb and decoder_Wemb is what's holding the encoder's and decoder's vocab lookup table.

If we use a sentencepiece model, we can fetch the input matrix as such:

import yaml
import numpy as np
import sentencepiece as spm

class SentencePieceModel(spm.SentencePieceProcessor):
    def __init__(self, filename):
        super().__init__()
        self.Load(filename)

    def tokenize(self, text):
        return self.EncodeAsPieces(text)

    def tokenize_as_id(self, text):
        return [self.PieceToId(tok) for tok in self.tokenize(text)]

marian_model = np.load('model.npz')
print(marian_model['encoder_Wemb'].shape) # Looks like it's the right shape.

src_vocab = SentencePieceModel('vocab.src.spm')
text = "桂 正和(1962年12月10日 - )は日本の男性漫画家。プロダクション名は STUDIO K2R。福井県生まれの千葉県育ち。阿佐ヶ谷美術専門学校中退。血液型はA型。2015年より嵯峨美術大学客員教授。"
print(src_vocab.tokenize_as_id(text))

# Sentence matrix
print(marian_model['encoder_Wemb'][np.array(src_vocab.tokenize_as_id(text))])

[out]:

(32000, 1024)

[2, 9030, 4131, 586, 9, 9707, 31, 30, 136, 58, 73, 53, 39, 2, 8, 11, 25530, 22084, 3, 14203, 2629, 2, 30428, 339, 31, 92, 3, 12306, 421, 15969, 25289, 18366, 3, 9952, 3523, 600, 890, 7907, 25036, 25790, 3, 11700, 11, 59, 187, 3, 5242, 30, 277, 27446, 7907, 557, 2650, 1732, 2429, 3]

(56, 1024)

[[ 0.00128347  0.00823173 -0.02790011 ... -0.03825213 -0.02486344
  -0.01369067]
 [-0.00026568  0.00450178  0.01777439 ... -0.01954525  0.03472571
   0.01101604]
 [-0.06159153  0.02491889  0.00045542 ... -0.04704078  0.03137614
   0.00711994]
 ...
 [-0.02930407  0.01101725 -0.01065689 ... -0.04611297 -0.02822422
  -0.04514204]
 [-0.05477448  0.02869316  0.01451852 ... -0.00833258 -0.05361627
  -0.02943963]
 [-0.0120432   0.00310046 -0.01707869 ... -0.03847058 -0.05063346
  -0.04491809]]

3. Now the rest of the layers gets a little fuzzy.

Though the naming makes sense, we'll like to confirm whether we're understanding them correctly.

print(marian_model.files)

[out]:


Out[65]:
['decoder_Wemb',
 'decoder_ff_logit_out_b',
 'decoder_l1_context_Wk',
 'decoder_l1_context_Wo',
 'decoder_l1_context_Wo_ln_bias_pre',
 'decoder_l1_context_Wo_ln_scale_pre',
 'decoder_l1_context_Wq',
 'decoder_l1_context_Wv',
 'decoder_l1_context_bk',
 'decoder_l1_context_bo',
 'decoder_l1_context_bq',
 'decoder_l1_context_bv',
 'decoder_l1_ffn_W1',
 'decoder_l1_ffn_W2',
 'decoder_l1_ffn_b1',
 'decoder_l1_ffn_b2',
 'decoder_l1_ffn_ffn_ln_bias_pre',
 'decoder_l1_ffn_ffn_ln_scale_pre',
 'decoder_l1_self_Wk',
 'decoder_l1_self_Wo',
 'decoder_l1_self_Wo_ln_bias_pre',
 'decoder_l1_self_Wo_ln_scale_pre',
 'decoder_l1_self_Wq',
 'decoder_l1_self_Wv',
 'decoder_l1_self_bk',
 'decoder_l1_self_bo',
 'decoder_l1_self_bq',
 'decoder_l1_self_bv',
 'decoder_l2_context_Wk',
 'decoder_l2_context_Wo',
 'decoder_l2_context_Wo_ln_bias_pre',
 'decoder_l2_context_Wo_ln_scale_pre',
 'decoder_l2_context_Wq',
 'decoder_l2_context_Wv',
 'decoder_l2_context_bk',
 'decoder_l2_context_bo',
 'decoder_l2_context_bq',
 'decoder_l2_context_bv',
 'decoder_l2_ffn_W1',
 'decoder_l2_ffn_W2',
 'decoder_l2_ffn_b1',
 'decoder_l2_ffn_b2',
 'decoder_l2_ffn_ffn_ln_bias_pre',
 'decoder_l2_ffn_ffn_ln_scale_pre',
 'decoder_l2_self_Wk',
 'decoder_l2_self_Wo',
 'decoder_l2_self_Wo_ln_bias_pre',
 'decoder_l2_self_Wo_ln_scale_pre',
 'decoder_l2_self_Wq',
 'decoder_l2_self_Wv',
 'decoder_l2_self_bk',
 'decoder_l2_self_bo',
 'decoder_l2_self_bq',
 'decoder_l2_self_bv',
 'decoder_l3_context_Wk',
 'decoder_l3_context_Wo',
 'decoder_l3_context_Wo_ln_bias_pre',
 'decoder_l3_context_Wo_ln_scale_pre',
 'decoder_l3_context_Wq',
 'decoder_l3_context_Wv',
 'decoder_l3_context_bk',
 'decoder_l3_context_bo',
 'decoder_l3_context_bq',
 'decoder_l3_context_bv',
 'decoder_l3_ffn_W1',
 'decoder_l3_ffn_W2',
 'decoder_l3_ffn_b1',
 'decoder_l3_ffn_b2',
 'decoder_l3_ffn_ffn_ln_bias_pre',
 'decoder_l3_ffn_ffn_ln_scale_pre',
 'decoder_l3_self_Wk',
 'decoder_l3_self_Wo',
 'decoder_l3_self_Wo_ln_bias_pre',
 'decoder_l3_self_Wo_ln_scale_pre',
 'decoder_l3_self_Wq',
 'decoder_l3_self_Wv',
 'decoder_l3_self_bk',
 'decoder_l3_self_bo',
 'decoder_l3_self_bq',
 'decoder_l3_self_bv',
 'decoder_l4_context_Wk',
 'decoder_l4_context_Wo',
 'decoder_l4_context_Wo_ln_bias_pre',
 'decoder_l4_context_Wo_ln_scale_pre',
 'decoder_l4_context_Wq',
 'decoder_l4_context_Wv',
 'decoder_l4_context_bk',
 'decoder_l4_context_bo',
 'decoder_l4_context_bq',
 'decoder_l4_context_bv',
 'decoder_l4_ffn_W1',
 'decoder_l4_ffn_W2',
 'decoder_l4_ffn_b1',
 'decoder_l4_ffn_b2',
 'decoder_l4_ffn_ffn_ln_bias_pre',
 'decoder_l4_ffn_ffn_ln_scale_pre',
 'decoder_l4_self_Wk',
 'decoder_l4_self_Wo',
 'decoder_l4_self_Wo_ln_bias_pre',
 'decoder_l4_self_Wo_ln_scale_pre',
 'decoder_l4_self_Wq',
 'decoder_l4_self_Wv',
 'decoder_l4_self_bk',
 'decoder_l4_self_bo',
 'decoder_l4_self_bq',
 'decoder_l4_self_bv',
 'decoder_l5_context_Wk',
 'decoder_l5_context_Wo',
 'decoder_l5_context_Wo_ln_bias_pre',
 'decoder_l5_context_Wo_ln_scale_pre',
 'decoder_l5_context_Wq',
 'decoder_l5_context_Wv',
 'decoder_l5_context_bk',
 'decoder_l5_context_bo',
 'decoder_l5_context_bq',
 'decoder_l5_context_bv',
 'decoder_l5_ffn_W1',
 'decoder_l5_ffn_W2',
 'decoder_l5_ffn_b1',
 'decoder_l5_ffn_b2',
 'decoder_l5_ffn_ffn_ln_bias_pre',
 'decoder_l5_ffn_ffn_ln_scale_pre',
 'decoder_l5_self_Wk',
 'decoder_l5_self_Wo',
 'decoder_l5_self_Wo_ln_bias_pre',
 'decoder_l5_self_Wo_ln_scale_pre',
 'decoder_l5_self_Wq',
 'decoder_l5_self_Wv',
 'decoder_l5_self_bk',
 'decoder_l5_self_bo',
 'decoder_l5_self_bq',
 'decoder_l5_self_bv',
 'decoder_l6_context_Wk',
 'decoder_l6_context_Wo',
 'decoder_l6_context_Wo_ln_bias_pre',
 'decoder_l6_context_Wo_ln_scale_pre',
 'decoder_l6_context_Wq',
 'decoder_l6_context_Wv',
 'decoder_l6_context_bk',
 'decoder_l6_context_bo',
 'decoder_l6_context_bq',
 'decoder_l6_context_bv',
 'decoder_l6_ffn_W1',
 'decoder_l6_ffn_W2',
 'decoder_l6_ffn_b1',
 'decoder_l6_ffn_b2',
 'decoder_l6_ffn_ffn_ln_bias_pre',
 'decoder_l6_ffn_ffn_ln_scale_pre',
 'decoder_l6_self_Wk',
 'decoder_l6_self_Wo',
 'decoder_l6_self_Wo_ln_bias_pre',
 'decoder_l6_self_Wo_ln_scale_pre',
 'decoder_l6_self_Wq',
 'decoder_l6_self_Wv',
 'decoder_l6_self_bk',
 'decoder_l6_self_bo',
 'decoder_l6_self_bq',
 'decoder_l6_self_bv',
 'encoder_Wemb',
 'encoder_l1_ffn_W1',
 'encoder_l1_ffn_W2',
 'encoder_l1_ffn_b1',
 'encoder_l1_ffn_b2',
 'encoder_l1_ffn_ffn_ln_bias_pre',
 'encoder_l1_ffn_ffn_ln_scale_pre',
 'encoder_l1_self_Wk',
 'encoder_l1_self_Wo',
 'encoder_l1_self_Wo_ln_bias_pre',
 'encoder_l1_self_Wo_ln_scale_pre',
 'encoder_l1_self_Wq',
 'encoder_l1_self_Wv',
 'encoder_l1_self_bk',
 'encoder_l1_self_bo',
 'encoder_l1_self_bq',
 'encoder_l1_self_bv',
 'encoder_l2_ffn_W1',
 'encoder_l2_ffn_W2',
 'encoder_l2_ffn_b1',
 'encoder_l2_ffn_b2',
 'encoder_l2_ffn_ffn_ln_bias_pre',
 'encoder_l2_ffn_ffn_ln_scale_pre',
 'encoder_l2_self_Wk',
 'encoder_l2_self_Wo',
 'encoder_l2_self_Wo_ln_bias_pre',
 'encoder_l2_self_Wo_ln_scale_pre',
 'encoder_l2_self_Wq',
 'encoder_l2_self_Wv',
 'encoder_l2_self_bk',
 'encoder_l2_self_bo',
 'encoder_l2_self_bq',
 'encoder_l2_self_bv',
 'encoder_l3_ffn_W1',
 'encoder_l3_ffn_W2',
 'encoder_l3_ffn_b1',
 'encoder_l3_ffn_b2',
 'encoder_l3_ffn_ffn_ln_bias_pre',
 'encoder_l3_ffn_ffn_ln_scale_pre',
 'encoder_l3_self_Wk',
 'encoder_l3_self_Wo',
 'encoder_l3_self_Wo_ln_bias_pre',
 'encoder_l3_self_Wo_ln_scale_pre',
 'encoder_l3_self_Wq',
 'encoder_l3_self_Wv',
 'encoder_l3_self_bk',
 'encoder_l3_self_bo',
 'encoder_l3_self_bq',
 'encoder_l3_self_bv',
 'encoder_l4_ffn_W1',
 'encoder_l4_ffn_W2',
 'encoder_l4_ffn_b1',
 'encoder_l4_ffn_b2',
 'encoder_l4_ffn_ffn_ln_bias_pre',
 'encoder_l4_ffn_ffn_ln_scale_pre',
 'encoder_l4_self_Wk',
 'encoder_l4_self_Wo',
 'encoder_l4_self_Wo_ln_bias_pre',
 'encoder_l4_self_Wo_ln_scale_pre',
 'encoder_l4_self_Wq',
 'encoder_l4_self_Wv',
 'encoder_l4_self_bk',
 'encoder_l4_self_bo',
 'encoder_l4_self_bq',
 'encoder_l4_self_bv',
 'encoder_l5_ffn_W1',
 'encoder_l5_ffn_W2',
 'encoder_l5_ffn_b1',
 'encoder_l5_ffn_b2',
 'encoder_l5_ffn_ffn_ln_bias_pre',
 'encoder_l5_ffn_ffn_ln_scale_pre',
 'encoder_l5_self_Wk',
 'encoder_l5_self_Wo',
 'encoder_l5_self_Wo_ln_bias_pre',
 'encoder_l5_self_Wo_ln_scale_pre',
 'encoder_l5_self_Wq',
 'encoder_l5_self_Wv',
 'encoder_l5_self_bk',
 'encoder_l5_self_bo',
 'encoder_l5_self_bq',
 'encoder_l5_self_bv',
 'encoder_l6_ffn_W1',
 'encoder_l6_ffn_W2',
 'encoder_l6_ffn_b1',
 'encoder_l6_ffn_b2',
 'encoder_l6_ffn_ffn_ln_bias_pre',
 'encoder_l6_ffn_ffn_ln_scale_pre',
 'encoder_l6_self_Wk',
 'encoder_l6_self_Wo',
 'encoder_l6_self_Wo_ln_bias_pre',
 'encoder_l6_self_Wo_ln_scale_pre',
 'encoder_l6_self_Wq',
 'encoder_l6_self_Wv',
 'encoder_l6_self_bk',
 'encoder_l6_self_bo',
 'encoder_l6_self_bq',
 'encoder_l6_self_bv',
 'special:model.yml']

3a. Please do correct me if I'm getting it wrong, encoder_l* refers to the layers in the encoder side of the architecture, and decoder_l* refers to the layers in the decoder side.

3b. There are 6 layers so the numerical value after *_l\d_* is the layer number, starting from 1 to 6, indexing starts from 1.

3c. And in the decoder there's a decoder_ff_logit_out_b layer that takes the output from one of the decoder_l6_* layer and feeds it into the decoder_ff_logit_out_b to produce the indices that maps to the vocabulary. Is that correct?

3d. The *coder_l\d_self_* prefix refers to self attention layers for both the encoder and decoder.

3e. And the first layer that takes in the sentence matrix embeddings (e.g. output of (2a) point above) is the the encoder_l1_self_* layers.

3f. In attention, the QKV computations comes first and the weights from *coder_l\d_self_Wq, *coder_l\d_self_Wk, *coder_l\d_self_Wv and and biases from *coder_l\d_self_bq, *coder_l\d_self_bk, *coder_l\d_self_bv are representing the QKV memories from the self-attention layer.

So the output of the sentence matrix embedding is fed to these 6 layers

Cut-away question here, what does 'coder_l\d_self_Wo' and 'coder_l\d_self_bo' layer represent then? Is that the output layer after the QKV computation?

So after the QKV (self-attention) computation of the first layer, it gets fed into the encoder_l1_self_Wo and encoder_l1_self_bo layers?

What about these layers:

How are they used? Is it after the QKV computation before feeding into encoder_l1_self_Wo and encoder_l1_self_bo?

3g. After the getting the outputs from *coder_l\d_self_Wo and *coder_l\d_self_bo, it's passed to the *coder_l\d_ffn_* feedforward layers

E.g. the outputs of encoder_l1_self_Wo and encoder_l1_self_bo would be passed to encoder_l1_ffn_W1 and encoder_l1_ffn_b1 and subsequently encoder_l1_ffn_W2 and encoder_l1_ffn_b2.

But how about these layers:

How are they used? Is it after applied after the coder_l\d_self_Woandcoder_l\d_self_boand before thecoder_l\dffn` layers?

3h. At the end of the encoder_l* layers (specifically encoder_l1_ffn_W2 and encoder_l1_ffn_b2), it's passed to the decoder_l2_* QKV layers. Is this correct?

Another cut-away: in that case the decoder_l1_* layers are the QKV and FFN computation from the decoder tokens, is that right?

In that case, if it's the start of the sentence, what gets fed into the decoder_Wemb and eventually into the decoder_l1_* layers? Is there a <s> padded at the empty state?

3i. After propagating the decoder layers from decoder_l2_* to decoder_l6_*, the final encoder_l6_ffn_W2 and encoder_l6_ffn_b2 gets fed into the decoder_ff_logit_out_b

Then it does an argmax and outputs the predicted vocabulary index at the current state. Is my understanding of that correct too?

4. And what about the *coder_l\d_context_* layers? Are those only use at decoding for attention?

lkfo415579 commented 4 years ago

you should look at the source code "models/transformer.h".

  1. the context layer only appear in decoder part, which is the second sub-layer in decoder. (getting information from encoder last context output)
emjotde commented 4 years ago

Closing.