ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.
BSD 3-Clause "New" or "Revised" License
410 stars 104 forks source link

some questions about multi-source based transformer model #828

Closed wyjllm closed 5 years ago

wyjllm commented 5 years ago

when I train a multi-source based transformer model using neuralmonkey, I encounter some problems: 1.problems about bpe: If I want to use bpe when training transformer models, do I have to use the bpe training scripts provided by neuralmonkey and adding the sections of [bpe_preprocess] and [bpe_postprocess]? 2.problems about model_dimension:When writing configuration files, I found a setting called model_dimension: Size of the hidden states of decoder and encoder. When I set this parameter to be the same Size as embedding Size, the training effect was terrible.When I make it smaller, it works better. Does this value determine the speed of optimization? Does it have to be set to the same size as the embeding size or to the other value?And last what is the appropriate initial value of the learning rate?

jindrahelcl commented 5 years ago
  1. you have two options here. First, you can use the preprocess and postprocess objects to apply BPE at runtime. this is however not ideal since you can apply BPE beforehand and store the preprocessed data on disk to speed up the training a bit. You can use e.g. fastBPE to extract a vocabulary and apply BPEs to your data. You will need to prepare a vocabulary file which is compatible with neural monkey's function from_wordlist from the vocabulary module. It takes a TSV file that looks like this:

    <pad>
    <s>
    </s>
    <unk>
    First
    word
    and
    so
    on
    [...]

    With this file format, you also need to set contains_frequencies and contain_header to False in the from_wordlist function. Note that the ordering of the four special tokens matters.

  2. If you mean the model_dimension of the noam_decay function, it corresponds to the $d_{model}$ variable in the Attention is All You Need paper. I can't really help you with finding the right learning rate scheme parameters, you need to try what works best for your data and the rest of the hyper-parameters.

wyjllm commented 4 years ago

Thank you very much.