Marian transfomer configuration files

This are mostly borrowed from the predefined aliases in Marian in the --task option.

transformer-base

mini-batch-fit: True
shuffle-in-ram: true

after: 600000u
keep-best: True
save-freq: 5000
overwrite: True
disp-freq: 1000
disp-first: 10
quiet-translation: true
early-stopping: 10
early-stopping-on: first
valid-freq: 5000
valid-mini-batch: 64
valid-metrics:
    - chrf
    - ce-mean-words
    - bleu-detok

beam-size: 6
normalize: 1
exponential-smoothing: 0.0001
max-length: 200

cost-type: ce-mean-words
type: transformer
enc-depth: 6
dec-depth: 6
dim-emb: 512
transformer-heads: 8
transformer-dim-ffn: 2048
transformer-ffn-depth: 2
transformer-ffn-activation: swish
transformer-decoder-autoreg: self-attention

transformer-dropout: 0.1
label-smoothing: 0.1
layer-normalization: True

learn-rate: 0.0003
lr-warmup: 16000
lr-decay-inv-sqrt: 16000
lr-report: True
optimizer-params:
    - 0.9
    - 0.98
    - 1e-09
clip-norm: 0 #disable clipnorm because it's buggy
sync-sgd: true

transformer-big

mini-batch-fit: True
shuffle-in-ram: true

after: 600000u
keep-best: True
save-freq: 5000
overwrite: True
disp-freq: 1000
disp-first: 10
quiet-translation: true
early-stopping: 10
early-stopping-on: first
valid-freq: 5000
valid-mini-batch: 32
valid-metrics:
    - chrf
    - ce-mean-words
    - bleu-detok

beam-size: 6
normalize: 1.0
exponential-smoothing: 1e-4
max-length: 200

cost-type: ce-mean-words
type: transformer
enc-depth: 6
dec-depth: 6
dim-emb: 1024
transformer-heads: 16
transformer-dim-ffn: 4096
transformer-ffn-depth: 2
transformer-ffn-activation: swish
transformer-decoder-autoreg: self-attention

transformer-dropout: 0.1
label-smoothing: 0.1
layer-normalization: True

learn-rate: 0.0002
lr-warmup: 8000
lr-decay-inv-sqrt: 8000
lr-report: True
optimizer-params:
    - 0.9
    - 0.998
    - 1e-09
clip-norm: 0
sync-sgd: true

To have enough batch size depending on GPUs used, my use cases have been:

transformer-base in 12GB 4xGPUs: 8000 workspace and optmizer-delay 4.
transformer-base in 12GB 1xGPU: 8000 workspace and optimizer-delay 8.
transformer-big in 40GB 4xGPU: 30000 workspace and optimizer-delay 2.

The vocabulary part would be:

dim-vocabs:
  - 32000
  - 32000
vocabs:
  - vocab.spm # share vocab
  - vocab.spm
tied-embeddings-all: true  # tie source embeddings with target and output embeddings

to share sentencepiece vocab and embeddings. This gives a very easy to use marian, as it only needs raw text as input and it handles all the tokenization.

For languages that don't share script:

dim-vocabs:
  - 32000
  - 32000
vocabs:
  - vocab.src.spm  # separated vocab
  - vocab.trg.spm
tied-embeddings: true # tie only target and output embeddings

We might want to enable byte fallback for all the sentencepiece vocabs to mitigate broken outputs when some strange character comes in:

sentence-piece-options: '--byte_fallback'

hplt-project / bitextor-mt-models

Marian transfomer configuration files #1