To have enough batch size depending on GPUs used, my use cases have been:
transformer-base in 12GB 4xGPUs: 8000 workspace and optmizer-delay 4.
transformer-base in 12GB 1xGPU: 8000 workspace and optimizer-delay 8.
transformer-big in 40GB 4xGPU: 30000 workspace and optimizer-delay 2.
The vocabulary part would be:
dim-vocabs:
- 32000
- 32000
vocabs:
- vocab.spm # share vocab
- vocab.spm
tied-embeddings-all: true # tie source embeddings with target and output embeddings
to share sentencepiece vocab and embeddings. This gives a very easy to use marian, as it only needs raw text as input and it handles all the tokenization.
For languages that don't share script:
dim-vocabs:
- 32000
- 32000
vocabs:
- vocab.src.spm # separated vocab
- vocab.trg.spm
tied-embeddings: true # tie only target and output embeddings
We might want to enable byte fallback for all the sentencepiece vocabs to mitigate broken outputs when some strange character comes in:
This are mostly borrowed from the predefined aliases in Marian in the
--task
option.transformer-base
transformer-big
To have enough batch size depending on GPUs used, my use cases have been:
The vocabulary part would be:
to share sentencepiece vocab and embeddings. This gives a very easy to use marian, as it only needs raw text as input and it handles all the tokenization.
For languages that don't share script:
We might want to enable byte fallback for all the sentencepiece vocabs to mitigate broken outputs when some strange character comes in: