ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.
BSD 3-Clause "New" or "Revised" License
410 stars 104 forks source link

Optimization Verification test #745

Open kocmitom opened 6 years ago

kocmitom commented 6 years ago

Is it possible to compute optimization verification test? When you are translating a sentence from devset and the translation is different than reference sentence, I would like to find out if the problem is due to a beam search, who did not manage to find the correct translation. Or if it is due to the model, which evaluates correct translation with a worse score than the decoded translation.

jindrahelcl commented 6 years ago

Well, since the sentences from the devset are not used for optimization, I would say that it is probably caused by the model incapacity to generate the exact reference sentence. Beam search is a heuristic that does not guarantee discovery of the most likely output sentence given the model parameters..

Did I understand correctly, that by "correct", you mean the exact reference sentence? Do you have some pointers to a paper or explanation of the term "optimization verification test"?

You could use the model as a scorer (which it is in fact trained to do) and score the devset reference and the generated sentence. If the perplexity of the devset reference is lower, you can "blame" beam search. If the perplexity of the reference sentence is higher, you can blame the model parameters.

kocmitom commented 6 years ago

The label "optimization verification test" was given to it by Andrew Ng in Machine Learning Yearning - chapter 45. I didn't check other sources since I believed that he uses the proper term (but now I cannot find anything to it on Google, so maybe he invented the name). But it is basically what you say in the last paragraph.

It is not a problem of NM. It is just another type of evaluation, which will tell you if you should focus on the search algorithm (bigger beam, different length normalization etc) due to its inability to generate a reference. Or if the problem is in the model itself that evaluates reference worse than the output.

So is there an easy way how to obtain a score for reference output or would I have to make a huge changes in NM?

jindrahelcl commented 6 years ago

I don't know why but I've rewrited this answer at least three times always with a different answer. At the end, I settled with this:

I think this could be done with something as the perplexity runner. You provide the sequences you want to score as the targets to the decoder and then measure the runtime (or train) loss on these targets. For this, you would need to (copy and) adapt the perplexity runner (which now works only with classifiers) to work with autoregressive decoders as well (i.e. objects that generate sequences).

To conclude, no, I don't think huge changes are necessary. I think most of the functionality is already there.

jindrahelcl commented 6 years ago

EDIT: Of course the perplexity runner does already work with autoregressive decoders! I was just confused from my previous thoughts

jindrahelcl commented 6 years ago

So, solution: First, translate your data. Second, in the config, lose the runners generating the sentences and replace them with the perplexity runner which adds a new output data series which gets written into a file. Third, run the config twice, a) with reference sentences as s_target, b) with your translations as s_target. Fourth, compare the numbers in the two outputting files (they should have a number on each line and should have an equal number of lines as your input files).

jindrahelcl commented 6 years ago

... if you want to, change train_xents to runtime_xents in the perplexity runner and you will get another set of observations to inspect.

varisd commented 6 years ago

I think, you want to use runtime_xents if you want to know the probability (in this case perplexity) of the decoded sentence (train_xents uses training mode - each word generated in a time step is conditioned on the gold output from the previous step - the resulting probability/perplexity is not the one of the decoded sequence).

jindrahelcl commented 6 years ago

There is no "gold" input in this setup - you can replace the reference data with anything it will get fed into the decoder in training mode. There can be some interesting differences and maybe you can measure the exposure bias using these. But of course, runtime logits alone are perhaps more relevant than training logits.