lvapeab / nmt-keras

Neural Machine Translation with Keras
http://nmt-keras.readthedocs.io
MIT License
533 stars 130 forks source link

Regd Rare Words/OOV Tokens ? #130

Closed VP007-py closed 4 years ago

VP007-py commented 4 years ago

Need a few clarifications regarding how to handle rare words and heuristics in the configuration

lvapeab commented 4 years ago

Not exactly that approach, but a similar one: See Sec 3.3 from this paper. An unknown (target) word is replaced using alignment information. To do that, we assume that the attention mechanism acts as an alignment model. So, when we generate an unknown word (let's call it unk), we select the source word (let's call it src_candidate) with the highest attention. Then we apply one of the heuristics to replace it:

  • How does heuristic 2 handle cases where the languages are different from english i.e lower casing ?

All heuristics are language agnostic. If the source language has no casing information, heuristic 2 falls back to heuristic 1.

  • What happens if POS_UNK is disabled to False ?

Then your model can generate unknown words (see the discussion in the papers above).


An alternative to all these tricks is to use subwords instead of words. This is a standard practice in NMT and I recommend to do that.

VP007-py commented 4 years ago

For Heuristic 1,how is the alignment calculated for this toolkit ?

lvapeab commented 4 years ago

You can use the utils/build_mapping_file.sh script to obtain it. You'll need to install fast_align and change the path to the executable in that script. It will create a .pkl file containing the alignments. You then need to set the MAPPING variable from the config file pointing to this file:

https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L82

VP007-py commented 4 years ago

Okay ! Will try that right now

Finally is to possible to run subword based nmt with this ?

lvapeab commented 4 years ago

Yes, they are compatible. But if you use subwords, the unk problem is unlikely to happen (at least with the latin writing system or similar ones). Because, if the segmenter (say, BPE) finds an unknown word, it will segment it into known subwords. The extreme case of this are characters (e.g. Word -> W@@ o@@ r@@ d). So you end up with effective no unknown words.

If you want to prevent this behavior, you can set up words that shouln't be broken up. In Subword-nmt you can do this with the --glossaries option.

Finally, note that a (standard) NMT system doesn't consider these linguistic features. It only models sequences. The elements of the sequence are encoded as indices, independently of its linguistic meaning (words, chars or subwords).

P.S.: When using subwords you may still have unknown words: an unseen character would still be considered an unknown word.

VP007-py commented 4 years ago

Hey, After learning BPE and reapplying with vocabulary filter from subword-nmt I'm not sure on

It's a bit ambiguos about BPE_CODES_PATH = DATA_ROOT_PATH + '/training_codes.joint'

Assuming Files train.BPE.L1 and train.BPE.L2 are obtained with subword-nmt from train.L1 and train.L2

lvapeab commented 4 years ago

Currently, only joint BPE is supported (see its section in subwordnmt). This generates a single BPE file and that should be placed as BPE_CODES_PATH. If you want to use this file to segment your sentences, you should also set TOKENIZATION_METHOD = 'tokenize_bpe

In addition, if these are your first steps using subword techniques, I recommend you to make them explicit, as they are not obscured by other processes. I would:

  1. Learn a BPE from the training data.
  2. Apply it to my training data (source and target).
  3. Apply it to the source validation/test data, leaving the target as it is. This is because we don't want to evaluate MT quality on segmented data.
  4. In config.py set the detokenization options to revert BPE tokenization (DETOKENIZATION_METHOD = 'detokenize_bpe' and APPLY_DETOKENIZATION = True: https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L89-L91
  5. Train and evaluate as usual. So, you'll train with BPE but you'll generate and evaluate sentences without the segmentation.

You can check how to do the first 3 steps in this script and you can find config examples under the examples directory.

VP007-py commented 4 years ago

Thanks ! I can obtain Train.L1 and Train.L2; dev.L1 and test.L1 that are BPE processed after following the above steps.

So while training TOKENIZATION_METHOD=tokenize_bpe should be set and while decoding it both DETOKENIZATION_METHOD=detokenize_bpe and APPLY_DETOKENIZATION= True must be enabled in addition to the above ?

Maybe a update about this script in README ?

lvapeab commented 4 years ago

If the files set in the config (https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L16-L18) have been already processed by BPE, you don't want to set TOKENIZATION_METHOD=tokenize_bpe because it would apply the segmentation twice. In that case you should set TOKENIZATION_METHOD=tokenize_none.

Maybe a update about this script in README ?

Yes, feel free to open a PR describing how you did this. I can review it.