Open penny9287 opened 5 years ago
Hi, currently, the Transformer decoder only supports the multi-head scaled dot-product attention from the "Attention is All You Need" paper. If you provide multiple encoders, you can choose which attention combination strategy you want to use, one of serial
, parallel
, hierarchical
, and flat
.
I wonder how to specify the combination strategy for multiple encoders in the configuration file, have any examples?
just specify the attention_combination_strategy
parameter in the transformer decoder configuration. It can be one of serial, parallel, hierarchical, and flat.
I wonder how to modify the configuration file to train a multi-source based transformer model with different attention types.