Question about the marginal-EBM

Hannibal046 commented 2 years ago

Hi, thanks for your great work ! After reading the paper, I have a question about the marginal-EBM. Does marginal-EBM only receive N candidate translations and rank them ? How does the model figure out which one is more close to the source sentence in terms of BLEU score when source sentence is not exposed to the ranking model ? Do I understand this correctly ? Thanks for your help

rooshenas commented 2 years ago

I assume by source sentence, you mean the target translation of the source sentence. Basically, the model learns a parametric version of the BLEU function and uses the parametric model at the test time.

Hannibal046 commented 2 years ago

Hi @rooshenas , thanks for your reply. I mean the source sentence is the sentence to be translated. It is so insightful to point out the model is actually a parametric version of BLEU.

(Correct me if wrong) Let's say this parametric BLEU function is B, the translation model is T, the translation task is to translate src to trg and multiple candidates are generated by T with large beam. so in marginal-EBM:

T(scr) --> candidates --> B(candidates) -->best candidates

in joint-EBM:

T(src) --> candidates --> B(src,candidate) --> best candidates

Do I understand this correctly ? So my question is how does marginal-EBM figure out the "BLEU" score with only candidate as input ? Also, I am curious about the training data for reranker. If using training data for translation model to train reranker, would this cause distribution shift between training set and test set ? Since you trained translation model in training set and generate candidates in training set for reranking, wouldn't this give a nearly perfect result considering the neural model's amazing capacity for memorization ?

Thanks again for your help :)

Hannibal046 commented 2 years ago

BTW, this is what I observed in another translation dataset composed of 600k training samples and 2k testing samples. The distribution variance between training set and testing set is quite obvious.

draft ipynb — nmt SSH: 45a3159k71 zicp vip 2022-06-04 13-32-36

rooshenas commented 2 years ago

Yes, your overview is correct. However, you have to consider that we use T as a proposer, and B model provides importance weight (to do importance sampling from B). You need to sample from T rather than selecting top k beam (this is important). During the training you have to resample two translations from the pool that you get from T using the importance weight provided by B, this ensures that your samples come from the model distribution defined by B. During test time, we need MAP inference, so you just need to rerank, but resampling during training is important since the training algorithm requires samples from B not from T.

Regarding your question, the energy-based model that describes marginal-EBM can model multi-modal distribution meaning that each modality corresponds to one source sentence. However, since it does not have explicit conditioning there would be some inconsistencies hence the lower performance. If you look at the provided Spearman coefficient, the negative correlation bump is obvious, which I believe it happens because of the confusion between the translation of different sentences.

You can always calculate the oracle BLEU score, which gives the achievable upper bound performance by this model. A portion of the discrepancy between test and train is attributable to T, not B, which shows itself in the oracle BLEU score.

Hannibal046 commented 2 years ago

Hi, thanks for reply ! I'm still a little confused about the following questions:

I don't quiet understand why sampling two candidates using the importance weight provided by reranker actually matters. As shown in this paper SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization, all candidates matters. And in this paper Discriminative Reranking for Neural Machine Translation, it uses list wise ranking. In my opinion, your method shares some similarity with SimCLS in terms of pair wise ranking and more candidates means more data to train reranker. Could you please share more insight about this ?
what I mean by distribution shift is that when training the reranker, the reranker will face candidates all in high BLEU scores because the translation model got trained on the training set. But when testing, the reranker will face up with candidates of relatively low BLEU scores. Wouldn't this cause some problems ? In this paper SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization, they actually split the training data and cross-generate data for reranking. So I'm curious about how do you deal with this problem ?

Do you generate training data for reranker like this ? (Omit validation set for brevity and Correct me if wrong)

bitext D_train = (source,target) and D_test
train a translation model T using D_train
use T with large beams on D_train to generate candidates D_train_reranking
use D_train_reranking to train a reranker R
use T with large beams on D_test to generate candidates D_test_reranking
use R to reranker candidates on D_test_reranking

If above is the case, D_train_reranking and D_test_reranking should be totally in different data distribution as I mentioned in the above picture. Do I understand your algorithm correctly ?

rooshenas commented 2 years ago

1) Our model uses a probabilistic framework, where you need samples from the model (see contrastive divergence) for parameter estimation. It is important that the sample comes from the final model (energy-based model). Here, we generalize contrastive divergences with rank-based training which requires two samples from the model rather than one sample. Table 3 shows the importance of using two samples rather than gold data against one sample.

2) We didn't notice this as a major problem in our setting as the difference in the oracle BLEU score and reranker performance is consistent between test and train. The oracle BLEU score on the test set is considerably higher than Beam Decoding of the translation model which indicates there is no meaningful distributional shift.

We don't use a large beam, but we use multinomial sampling to generate the candidates. The rest are correct.

Hannibal046 commented 2 years ago

But why exactly two samples ? Suppose you use multinomial sampling to generate N candidates but only pick 2 samples to server as data to do parameter estimation, however, other N-2 candidates also have some sort of ranking realtions with respect to end metrics like BLEU, would this be a kind of computational waste ? Also, an extremely large Temperature (1_000 in the paper ) seems causing uniform distribution like this:
```
import torch
import torch.nn.functional as F
temperature = 1_000
scores = -torch.rand(1,10)/temperature
F.softmax(scores,dim=1)
```

tensor([[0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000]])

But why would something like this happen ? Isn't it a bad Machine Learning practice to generate translations on the dataset the model was trained on ? Isn't this a situation of 100% label leakage ?

Thanks so much for your kind help !

rooshenas commented 2 years ago

1) In rank-based training, you can compare two points. The training algorithm may visit the rest of N candidates in different epochs. The energy values may become very large, in your example, the scores are between 0 and 1. 2) This is a particular problem regarding beam decoding and also the mismatch between training loss (CE) and task score BLEU.

Hannibal046 commented 2 years ago

Hi, thanks for reply. This is the revised version using sampling generation on the training set and testing set.

draft ipynb — nmt SSH: 45a3159k71 zicp vip 2022-06-05 15-18-47

I think this may be not relevant to the particular decoding method but more about a basic ML problem. Can model trained on the training set give consistent performance with respect to some metrics on both training set and testing set ?

Hannibal046 commented 2 years ago

In rank-based training, you can compare two points. The training algorithm may visit the rest of N candidates in different epochs. The energy values may become very large, in your example, the scores are between 0 and 1.

This is a particular problem regarding beam decoding and also the mismatch between training loss (CE) and task score BLEU.

Hi, could you please share the average energy value in your experiments ? In the Table 8 of your paper, it seems that the energy value is quite small and large temperature will cause a uniform distribution. BTW, energy value is the average of mbert representation after a two-layer projection, do I understand this correctly ? do you use T=1000,alpha=10 across all you experiments ?

energy_value in joint_ebm: e = average(W1(ActivationFunc(W2(mBert([src,trg]))))

Also, is it a typo in equation (1) which is actually a margin ranking loss ? The energy of high bleu candidate should be higher then that of candidate with low bleu, so the loss function should be reversed ?

rooshenas commented 2 years ago

I don't think that we have kept the training logs. However, the energy values get large to reduce the violation.
We didn't tune temperature and margin weight across different experiments. The energy value of a prediction with a higher BLEU score must be lower.

rooshenas / ebr_mt

Question about the marginal-EBM #3