Idea : add extracted n-gram pairs to the neural training

EtienneAb3d commented 6 years ago

When looking at the neural translation log, it appears to me that, very often, for the live re-training, MMT is using large (or very large) sentence, where only a very small part is interesting, giving it a very low score. This seems to be often of poor effect, since the results rarely take it into account.

I had the idea to extract chunks pairs from the training sentences, and add them to the training data. For 16M pairs of sentences (mainly MultiUN, Europarl, ..) I got 60M pairs of chunks (of course with many redundancies, and a lot of noise/errors).

1) the model should learn directly how to translate these small part of text, all given alone, rather than having to find it by itself from very large sentences, without explicit information on the sub-alignments.

2) knowing about how sub-parts should be translated, it could be easier for it to learn how to translate large sentences.

3) perhaps this could bring with a better learning on the attention module. This, alone, can be a good point for the training and the final quality of the model.

4) the noise/errors on chunks can even be useful : a way to tell the network that "A" should be translated in "B" whatever its context (noisy/damaged surroundings). I think, it's certainly nearly what is doing the drop-out parameter. It could play a regularisation role.

5) for the live retraining, MMT will be able to select pertinent small pieces of texts, rather than large rich sentences. This could be much more efficient at the translation time.

Problem 1: see #347

Problem 2: as the chunks are quite noisy, it would be very useful that the live retraining use much more than only one sentence. How this can be set ?

As far as I understand, MMT is extracting sub-alignments, right ? Would it be possible to automatically add these sub-alignments to the neural training data, on the fly during the training ?

EtienneAb3d commented 6 years ago

The whole number of words, is less than twice the number of words in the 16M original sentences.

Here is a quite complex example (also showing how I'm doing some fusions between chunks in my analyse). With this example, I think you will see the interest of the method for the training, and what are the kind of noise and errors.

FR: "Cependant , je vous demande, conformément à l'orientation désormais constamment exprimée par le Parlement européen et toute la Communauté européenne, d'intervenir auprès du président et du gouverneur du Texas, Monsieur Bush, en faisant jouer le prestige de votre mandat et de l'Institution que vous représentez, car c' est Monsieur Bush qui a le pouvoir de suspendre la condamnation à mort et de gracier le condamné. EN: "However, I would ask you, in accordance with the line which is now constantly followed by the European Parliament and by the whole of the European Community, to make representations, using the weight of your prestigious office and the institution you represent, to the President and to the Governor of Texas, Mr Bush, who has the power to order a stay of execution and to reprieve the condemned person."

Chunking: cependant , je vous demande , conformément à l' orientation désormais constamment exprimée par le Parlement européen et toute la Communauté européenne , d' intervenir auprès du président et du gouverneur du Texas , Monsieur Bush , en faisant jouer le prestige de votre mandat et de l' Institution que vous représentez , car c' est Monsieur Bush qui a le pouvoir de suspendre la condamnation à mort et de gracier le condamné . However , I would ask you , in accordance with the line which is now constantly followed by the European Parliament and by the whole of the European Community , to make representations , using the weight of your prestigious office and the institution you represent , to the President and to the Governor of Texas , Mr Bush , who has the power to order a stay of execution and to reprieve the condemned person .

After analyze, these chunks pairs are kept (in the 60M), some are merged chunks: KEPT: du gouverneur du Texas / the Governor of Texas KEPT: le Parlement européen / the European Parliament KEPT: et de l' Institution / and the institution KEPT: du président et / the President and KEPT: le prestige de votre mandat / the weight of your prestigious office KEPT: à mort et / and to reprieve KEPT: je vous demande / I would ask you KEPT: le pouvoir de suspendre / the power to order KEPT: toute la Communauté européenne / the whole of the European Community KEPT: vous représentez , car / you represent , to KEPT: conformément à l' orientation / in accordance with the line KEPT: de gracier le condamné / the condemned person KEPT: d' intervenir auprès / a stay of execution

davidecaroselli commented 6 years ago

This is a very interesting experiment, we have investigated something "similar" (more or less) on online learning. My concerns are basically two:

I would like to understand how much the length of the segments in the training set is relevant for the training: if the engine is trained on sentence with average length X, is it capable of translating more complex sentences of average length Y (with Y >> X) ?
You are basically reproducing the process of phrase extraction during phrase-based MT training. I agree that the "focus" on the words will be much more high, but will the engine be good in interpolate the phrases in order to have fluent translations? That is exactly the problem of the phrase-based and the benefit of Neural MT in general.

If you're experiment will answer my two questions, that will be great! Thanks for sharing your results.

Cheers , Davide

EtienneAb3d commented 6 years ago

A RNN is not learning phrases like a SMT. It is learning words sequences, on both encoder and decoder layers = what is the more probable word N after having seen words N-1, N-2, ... The more you show it some good words sequences, the more it will be able to learn these words sequences. When the network is learning that a word sequence A can be translated in both a word sequence B or a word sequence C, it just means that the probability to produce B or C when seeing A is higher than for other kind of words sequences. When these words sequences are learned they can be used on larger sequences. It's just a question of how all probabilities are influencing each others in the network. The large sentences (16M) are still in the training data (16M+60M), they aren't removed. The choice between several possible/probable words sequences is made according to these larger learned contexts.

mzeidhassan commented 6 years ago

@EtienneAb3d very interesting idea. If you don't mind, which tool did you use to do the chunking? Is it NLTK?

EtienneAb3d commented 6 years ago

I used our own tool. We are specialists of such a technical stuff. See for example our historical software Similis (now free).

EtienneAb3d commented 6 years ago

For your information: after few real tests, our translators are impressed by the quality of the new model, trained with this new chunk enriched data. I'm now working on 2 evolutions: 1) a better chunking and chunk alignment, 2) an algo that will use this chunking to evaluate original sentence pairs, and remove the bad ones. Please, can you give me some entry points to be able to get multiple selected sentences at the re-training time in MMT (see problem 2 in my original post) ? A strongly encouraged the MMT team to test something like this using their available SMT analyses.. ;-)

davidecaroselli commented 6 years ago

That sounds super cool @EtienneAb3d !

First things first:

Please, can you give me some entry points to be able to get multiple selected sentences at the re-training time in MMT (see problem 2 in my original post) ?

Edit the file <engine>/models/decoder/model.conf and at the very beginning add these lines:

[settings]
memory_suggestions_limit = XXX

Where of course XXX are the maximum suggestions you want to get from the memory.

By the way, if you would like to contribute to the open-source project, we will be very happy to test and integrate your improvements. Is that something you would like to/can share in details?

A last thing: not quite sure to get if you are using this technique on model training or on online adaptation - which one of the 2?

Thanks, Davide

EtienneAb3d commented 6 years ago

Thanks ! I give it a try right now ! :)

For your question, all is said on this sentence in my original post: "I had the idea to extract chunks pairs from the training sentences, and add them to the training data.".

All is done before the training. I'm extracting chunks pairs from the training sentences, and create new data files with them, added to the original sentence data.

I can't share my code as open-source. It's made with a lot of heavy proprietary parts. Sorry.

I think it's something you can do directly with your SMT analyses.

EtienneAb3d commented 6 years ago

Of course.. as the chunk pairs are added to the training data, they are also used by MMT in its online adaptation, like the original sentence pairs.

mzeidhassan commented 6 years ago

@EtienneAb3d This sounds great. Can you give us an idea how many chunks and full sentences are in your training data to achieve such great results? I see that you said above:

16M pairs of sentences plus 60M pairs of chunks. Did you use this number of strings in your training data?

Did you just use the default setup from MMT in terms of number of epochs, layers, etc.? Have you disabled the early stopping for example?

EtienneAb3d commented 6 years ago

Yes : 16M sentence pairs + 50M chunk pairs (my new algo is a bit more selective than the first one), all in the training data.

To avoid too fast stopping, and keep the training in about 1 week of calculation, I first used these parameters:

./mmt create 
  --learning-rate-decay-start-at 1000000 
  --learning-rate-decay-steps 50000 
  --learning-rate-decay 0.8 
  --validation-steps 50000 
  --checkpoint-steps 50000 
  -e FREN_New --neural fr en 
  /home/lm-dev8/TRAIN_DATA/train_FREN_FILTERED 
  --gpus 0

And I also made these modifications in the file src/main/python/nmmt/NMTEngineTrainer.py line 374:

                        if self.optimizer.lr < 0.01 and not perplexity_improves:
                            break
                        if self.optimizer.lr < 0.001:
                            break

To be more quantitative, in our translation interface, the translators having the choice between NMT or SMT or FullMatches or FuzzyMatches: BEFORE THE CHUNK ENRICHMENT: they took about 75% NMT and 25% SMT for post-edition. AFTER THE CHUNK ENRICHMENT: they are now taking about 95% NMT and 5% SMT for post-edition.

mzeidhassan commented 6 years ago

Thank you so much @EtienneAb3d for sharing this valuable information.

One last thing: You said above that there were

many redundancies, and a lot of noise/errors

What did you about them? Did you do any kind of cleanup prior to training?

Thanks again!

EtienneAb3d commented 6 years ago

I did 2 things: 1) I improved my chunking/pairing algo, and rejected chunk pairs with a too low quality estimation 2) I used the chunking covering to build an estimation of the segment pair qualities, to also reject segment pairs (and their chunk pairs) with a too low quality estimation

I have now a very interesting automatic chunk/terminology extractor, producing pairs with a very low error rate, and a nice automatic translation memory cleaner. ;-)

PS: I can make a demonstration on a provided data set for those who are interested. For the moment, optimized language pairs are only FR<->EN.

mzeidhassan commented 6 years ago

Thanks @EtienneAb3d for your reply. It sounds you have a great solution in place. Thanks for letting us know that you can make a demo. I will keep this in mind.

mzeidhassan commented 6 years ago

@EtienneAb3d I am trying to stop 'early termination' and I am trying to implement your code above, but not sure where exactly it should go.

Here is the early termination code from MMT. Can you please let me know where your modified code should go?

                    if len(self.state) >= self.opts.n_checkpoints:
                        perplexity_improves = previous_avg_ppl - avg_ppl > 0.0001

                        self._log('Terminate policy: avg_ppl = %g, previous_avg_ppl = %g, stopping = %r'
                                  % (avg_ppl, previous_avg_ppl, not perplexity_improves))

                        if not perplexity_improves:
                            break
        except KeyboardInterrupt:
            pass

        return self.state

Should I simply replace:

if not perplexity_improves:
                            break

with:

                        if self.optimizer.lr < 0.01 and not perplexity_improves:
                            break
                        if self.optimizer.lr < 0.001:
                            break

Thanks in advance for your help!

EtienneAb3d commented 6 years ago

Yes.

You should finally get this:

mzeidhassan commented 6 years ago

Thanks a million, @EtienneAb3d for getting back to me. I appreciate it.

mzeidhassan commented 6 years ago

Hi @EtienneAb3d and @davidecaroselli ,

It seems that during preprocessing, MMT excludes strings with low character count. So, my question to you @davidecaroselli : is there a way to force MMT to take such short strings? What is the character limit to include a string in the training data?

My question to @EtienneAb3d : did you find a way to achieve this in your solution? Did MMT use such strings in your training data?

However the line and by to make representations using and the institution you represent

Thanks to both of you!

EtienneAb3d commented 6 years ago

I don't notice such a limitation. How did you see it ?

mzeidhassan commented 6 years ago

Hi @EtienneAb3d, Sorry for the confusion. We were training with a placeholder file that doesn't contain meaningful data. So, for words or product names that we don't want to translate/protect, we replaced them with some placeholders like 'xxyyzz' and tried to train the data with these placeholders, but MMT didn't pick it up for some reason. I am not sure why, so my first guess was the length limitation.

EtienneAb3d commented 6 years ago

What do you mean by "MMT didn't pick it" ? How do you see this ?

Be careful, MMT is using byte pair encoding. You need to be sure the placeholders will be added in the vocab, and not cut in something else.

mzeidhassan commented 6 years ago

@EtienneAb3d I meant after adding these placeholder pairs to the training data, tried to translate some documents with exact same placeholders in there, but MMT didn't match these placeholders and sometimes changed it from 'xxyyzz' to 'zzxxyy' for example. The placeholder file I used was very small though, just about 30-40 lines.

EtienneAb3d commented 6 years ago

It's because of the byte pair encoding.

mzeidhassan commented 6 years ago

@EtienneAb3d From your experience, what is the best way to deal with BPE issues? How to prevent these made-up words? Thanks in advance!

EtienneAb3d commented 6 years ago

Without a special parameter in the MMT code, I do not have a real solution. Try to use placeholders with a very very simple form, like "xx" or "yy". You may try to use not-alphabetical chars.

davidecaroselli commented 6 years ago

I'm closing this issue but if you have any other update or just want to continue with the discussion please feel free to re-open it!

Cheers, Davide

EtienneAb3d commented 6 years ago

Why closing it !? It was an evolution suggestion, and an opened discussion. The only reason I see to close it, is to definitively show that you aren't interested. Since it's not the first time, I finally doubt you are interested in any suggestion that wasn't in your own road-map. Perhaps I should make my own work on my own side without loosing the time to share it with you.

davidecaroselli commented 6 years ago

Hi @EtienneAb3d

first of all, sorry if you felt offended by this action. We close issues when we suppose the discussion is over, and no more results/ideas will be published. In this case the discussion wasn't be updated from 20 days, so I supposed it was over. Closing a discussion doesn't mean we are not interested, it won't be "deleted" in any way.

On the other hand we really appreciated contributions both in ideas and pull requests. Because we don't have such large team, yes: sometimes we don't have enough resources to deviate from our road map and internal decisions so, again, please don't be offended by the fact that we do not implement ourself ideas coming from the community.

With that said, I understand that you are still working on this and you probably will have updates on this discussion too. So please, keep contributing (to this and/or other ideas), we will try to do our best to make our community enjoy using ModernMT and feel free to discuss and contribute.

Cheers, Davide

EtienneAb3d commented 6 years ago

If you want the community to contribute, you need to show that suggestions and open discussions are alive. I can understand that you have limited resources. But, if you are sterilizing all after few weeks because nothing occurs on it, 1) it's a bit frustrating for the one trying to share something, and 2) the new incoming user will just see a blank empty place and won't be really encouraged in sharing or discuss something on his turn...

davidecaroselli commented 6 years ago

@EtienneAb3d you are right.

lkluo commented 5 years ago

Great idea, @EtienneAb3d. I suppose you have achieved significant improvements with adding additional n-gram pairs. I have two questions about your method:

How significant the improvement is, e.g., in terms of BLEU gains?
You mentioned you have 16M sentence pairs and 50M n-gram pairs in your training data (which obviously short sentences are dominant), will the translations of long sentences be affected (i.e., the model tends to produce short translations due to data imbalance)?

EtienneAb3d commented 5 years ago

Hi @lkluo,

We didn't make such a BLEU evaluation, because:

the BLEU score is an interesting evaluation to compare 2 trainings in exact initial training/test conditions, but our conditions, in industrial production context, are always changing. It's a bit hard for us to take the time to make such a demonstration, it's not really our purpose.
the BLEU score is a poor metric for industrial matters. The real metric would rather be the time spent by translators, thus the post-editing cost. Due to the human factor, and always changing projects, it's even more hard, not to say impossible, to be sure to really get pertinent comparisons in very identical conditions.

So, our evaluation was quite qualitative and subjective. What we noticed was:

the translators said they were impressed by the improvement in quality, especially on the sub-sentence levels: terminology is much more accurate. The neural MT translations still have the global fluidity, but with a terminology much more similar to the statistical MT translations.
in our tool, the translators always have the choice between the several neural/statistical MT propositions, TM propositions, or a direct manual translation from scratch. The number of neural-MT-post-edition choices significantly went up, from around 75% to perhaps 80% or 90%.

michael-conrad commented 5 years ago

This seems like something I would like to look into as my corpus is small, and no larger corpus is available.

Could something like "phrasal" be used to "extract" out extreme high probable matching phrase sets?

Any other recommendations?

Note: I also plan on creating a program to create a synthetic corpus of simple sentence pairs based on grammatical rules. Any recommendations here?

EtienneAb3d commented 5 years ago

You may have a look at my open-source chunking tool: https://github.com/EtienneAb3d/OpenNeuroSpell Online demonstrator: http://nschunker.cubaix.com/ Term extraction with pairing and optional semantic grouping: http://lextract.cubaix.com/

mzeidhassan commented 5 years ago

Thanks @EtienneAb3d for sharing. How can you extract bilingual chunks? Is this possible in this release?

EtienneAb3d commented 5 years ago

@mzeidhassan, my matchers aren't open-source yet. For small needs, you may use my online tool above. Ask me for larger processing.

modernmt / modernmt

Idea : add extracted n-gram pairs to the neural training #348