Closed Hannibal046 closed 2 years ago
I assume by source sentence, you mean the target translation of the source sentence. Basically, the model learns a parametric version of the BLEU function and uses the parametric model at the test time.
Hi @rooshenas , thanks for your reply. I mean the source sentence is the sentence to be translated. It is so insightful to point out the model is actually a parametric version of BLEU.
(Correct me if wrong)
Let's say this parametric BLEU function is B, the translation model is T, the translation task is to translate src
to trg
and multiple candidates
are generated by T with large beam.
so in marginal-EBM
:
T(scr) --> candidates --> B(candidates) -->best candidates
in joint-EBM
:
T(src) --> candidates --> B(src,candidate) --> best candidates
Do I understand this correctly ? So my question is how does marginal-EBM
figure out the "BLEU" score with only candidate as input ?
Also, I am curious about the training data for reranker. If using training data for translation model to train reranker, would this cause distribution shift between training set and test set ? Since you trained translation model in training set and generate candidates in training set for reranking, wouldn't this give a nearly perfect result considering the neural model's amazing capacity for memorization ?
Thanks again for your help :)
BTW, this is what I observed in another translation dataset composed of 600k training samples and 2k testing samples. The distribution variance between training set and testing set is quite obvious.
Yes, your overview is correct. However, you have to consider that we use T as a proposer, and B model provides importance weight (to do importance sampling from B). You need to sample from T rather than selecting top k beam (this is important). During the training you have to resample two translations from the pool that you get from T using the importance weight provided by B, this ensures that your samples come from the model distribution defined by B. During test time, we need MAP inference, so you just need to rerank, but resampling during training is important since the training algorithm requires samples from B not from T.
Regarding your question, the energy-based model that describes marginal-EBM can model multi-modal distribution meaning that each modality corresponds to one source sentence. However, since it does not have explicit conditioning there would be some inconsistencies hence the lower performance. If you look at the provided Spearman coefficient, the negative correlation bump is obvious, which I believe it happens because of the confusion between the translation of different sentences.
You can always calculate the oracle BLEU score, which gives the achievable upper bound performance by this model. A portion of the discrepancy between test and train is attributable to T, not B, which shows itself in the oracle BLEU score.
Hi, thanks for reply ! I'm still a little confused about the following questions:
distribution shift
is that when training the reranker, the reranker will face candidates all in high BLEU scores because the translation model got trained on the training set. But when testing, the reranker will face up with candidates of relatively low BLEU scores. Wouldn't this cause some problems ? In this paper SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization, they actually split the training data and cross-generate data for reranking. So I'm curious about how do you deal with this problem ? Do you generate training data for reranker like this ? (Omit validation set for brevity and Correct me if wrong)
If above is the case, D_train_reranking and D_test_reranking should be totally in different data distribution as I mentioned in the above picture. Do I understand your algorithm correctly ?
1) Our model uses a probabilistic framework, where you need samples from the model (see contrastive divergence) for parameter estimation. It is important that the sample comes from the final model (energy-based model). Here, we generalize contrastive divergences with rank-based training which requires two samples from the model rather than one sample. Table 3 shows the importance of using two samples rather than gold data against one sample.
2) We didn't notice this as a major problem in our setting as the difference in the oracle BLEU score and reranker performance is consistent between test and train. The oracle BLEU score on the test set is considerably higher than Beam Decoding of the translation model which indicates there is no meaningful distributional shift.
We don't use a large beam, but we use multinomial sampling to generate the candidates. The rest are correct.
import torch
import torch.nn.functional as F
temperature = 1_000
scores = -torch.rand(1,10)/temperature
F.softmax(scores,dim=1)
tensor([[0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000]])
Thanks so much for your kind help !
1) In rank-based training, you can compare two points. The training algorithm may visit the rest of N candidates in different epochs. The energy values may become very large, in your example, the scores are between 0 and 1. 2) This is a particular problem regarding beam decoding and also the mismatch between training loss (CE) and task score BLEU.
Hi, thanks for reply. This is the revised version using sampling generation on the training set and testing set.
I think this may be not relevant to the particular decoding method but more about a basic ML problem. Can model trained on the training set give consistent performance with respect to some metrics on both training set and testing set ?
- In rank-based training, you can compare two points. The training algorithm may visit the rest of N candidates in different epochs. The energy values may become very large, in your example, the scores are between 0 and 1.
- This is a particular problem regarding beam decoding and also the mismatch between training loss (CE) and task score BLEU.
Hi, could you please share the average energy value in your experiments ? In the Table 8
of your paper, it seems that the energy value is quite small and large temperature will cause a uniform distribution. BTW, energy value is the average of mbert representation after a two-layer projection, do I understand this correctly ? do you use T=1000,alpha=10
across all you experiments ?
energy_value in joint_ebm: e = average(W1(ActivationFunc(W2(mBert([src,trg]))))
Also, is it a typo in equation (1)
which is actually a margin ranking loss ? The energy of high bleu candidate should be higher then that of candidate with low bleu, so the loss function should be reversed ?
I don't think that we have kept the training logs. However, the energy values get large to reduce the violation.
We didn't tune temperature and margin weight across different experiments.
The energy value of a prediction with a higher BLEU score must be lower.
Hi, thanks for your great work ! After reading the paper, I have a question about the
marginal-EBM
. Doesmarginal-EBM
only receive N candidate translations and rank them ? How does the model figure out which one is more close to thesource
sentence in terms ofBLEU
score whensource
sentence is not exposed to the ranking model ? Do I understand this correctly ? Thanks for your help