Open howardyclo opened 6 years ago
This paper purposes an approach "Cold Fusion" for leveraging a pre-trained language model during training a neural sequence-to-sequence (Seq2Seq) model. In Cold Fusion, the Seq2Seq model is trained from scratch together with a fixed pre-trained language model by using a fine-grained gating mechanism to fuse the hidden state of Seq2Seq's hidden state and the logit output of language model. They show that by leveraging the RNN language model, Cold Fusion reduces word error rates by up to 18% compared to Deep Fusion in speech recognition. They also show that Cold Fusion models can transfer more easily to new domains, and with only 10% of labeled data nearly fully transfer to the new domain.
This paper's direction is similar to #5 but #5 's language model is trained from scratch.
Both back-translation and unsupervised pre-training are simple methods that require no change in the architecture.
They use a fine-grained gated mechanism (Yang et al. 2017) to fuse the hidden states of Seq2Seq and the probability (logits, different from Deep Fusion) of language model, learning when to pay attention to Seq2Seq or language model.
Why not use a normalized probability?
Note that the fusion in this paper is different from Yang et al. (2017), where Yang et al. uses this way:
h = f(v1, v2) = g ⊙ v1 + (1 − g) ⊙ v2
(Fusing two representations/vectors v1
and v2
)
The following table shows the dev set perplexity for char-RNN language models trained on different datasets on source and target domain. This experiment shows that language models are easily overfitted on training distribution, so models trained on one corpus will perform poorly on a different distribution. Thus, they use the model trained on the full dataset (which contains the source and target datasets along with some additional text) for all of the LM integration experiments.
Hi Howard, Nice explanation on cold fusion approach. I would like to replicate the same experiment for my seq2seq model. If you have the code in Github repository could you please share with me.
Thanks Naresh
If you could help me it would be very helpful for me.
@ellurunaresh Hi, I am not the author of this paper, so I do not have the code. Please implement by yourself (it seems to be easy by just follow the equation...) or search whether there is other implementation available online or not.
@ellurunaresh Hi. Have you managed to implement Cold Fusion or find any resource? I will begin to experiment it with BERT and ULMFIT.
Can you describe what the (tensorflow) operations would be to implement (4c)? I'm not sure what that line's syntax means...? (specifically the [ ] and ◦)
Metadata