Closed VP007-py closed 4 years ago
- Does it follow this approach ? A brief summary of it can be found here
Not exactly that approach, but a similar one: See Sec 3.3 from this paper. An unknown (target) word is replaced using alignment information. To do that, we assume that the attention mechanism acts as an alignment model. So, when we generate an unknown word (let's call it unk
), we select the source word (let's call it src_candidate
) with the highest attention. Then we apply one of the heuristics to replace it:
0
: Replace the unknown word with the aligned source word (unk
-> src_candidate
).1
: Replace the unknown word with the translation of the aligned source word (unk
-> translation(src_candidate
)). The translation here is given by a statistical dictionary (e.g. fast_align).2
: Applies heuristic 1
if src_candidate
starts with a lower case, otherwise, it applies heuristic 0
. The rationale behind this is that proper nouns (starting with a capital case) should appear as they are in the translation.
- How does heuristic 2 handle cases where the languages are different from english i.e lower casing ?
All heuristics are language agnostic. If the source language has no casing information, heuristic 2 falls back to heuristic 1.
- What happens if
POS_UNK
is disabled to False ?
Then your model can generate unknown words (see the discussion in the papers above).
An alternative to all these tricks is to use subwords instead of words. This is a standard practice in NMT and I recommend to do that.
For Heuristic 1,how is the alignment
calculated for this toolkit ?
You can use the utils/build_mapping_file.sh script to obtain it. You'll need to install fast_align and change the path to the executable in that script. It will create a .pkl
file containing the alignments. You then need to set the MAPPING
variable from the config file pointing to this file:
https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L82
Okay ! Will try that right now
Finally is to possible to run subword based nmt with this ?
Yes, they are compatible. But if you use subwords, the unk problem is unlikely to happen (at least with the latin writing system or similar ones). Because, if the segmenter (say, BPE) finds an unknown word, it will segment it into known subwords. The extreme case of this are characters (e.g. Word
-> W@@ o@@ r@@ d
). So you end up with effective no unknown words.
If you want to prevent this behavior, you can set up words that shouln't be broken up. In Subword-nmt you can do this with the --glossaries
option.
Finally, note that a (standard) NMT system doesn't consider these linguistic features. It only models sequences. The elements of the sequence are encoded as indices, independently of its linguistic meaning (words, chars or subwords).
P.S.: When using subwords you may still have unknown words: an unseen character would still be considered an unknown word.
Hey, After learning BPE and reapplying with vocabulary filter from subword-nmt I'm not sure on
It's a bit ambiguos about BPE_CODES_PATH = DATA_ROOT_PATH + '/training_codes.joint'
Assuming Files train.BPE.L1
and train.BPE.L2
are obtained with subword-nmt from train.L1
and train.L2
Currently, only joint BPE is supported (see its section in subwordnmt). This generates a single BPE file and that should be placed as BPE_CODES_PATH
. If you want to use this file to segment your sentences, you should also set TOKENIZATION_METHOD = 'tokenize_bpe
In addition, if these are your first steps using subword techniques, I recommend you to make them explicit, as they are not obscured by other processes. I would:
config.py
set the detokenization options to revert BPE tokenization (DETOKENIZATION_METHOD = 'detokenize_bpe'
and APPLY_DETOKENIZATION = True
:
https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L89-L91You can check how to do the first 3 steps in this script and you can find config examples under the examples directory.
Thanks ! I can obtain Train.L1 and Train.L2; dev.L1 and test.L1 that are BPE processed after following the above steps.
So while training TOKENIZATION_METHOD=tokenize_bpe
should be set and while decoding it both DETOKENIZATION_METHOD=detokenize_bpe
and APPLY_DETOKENIZATION= True
must be enabled in addition to the above ?
Maybe a update about this script in README ?
If the files set in the config (https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L16-L18) have been already processed by BPE, you don't want to set TOKENIZATION_METHOD=tokenize_bpe
because it would apply the segmentation twice. In that case you should set TOKENIZATION_METHOD=tokenize_none
.
Maybe a update about this script in README ?
Yes, feel free to open a PR describing how you did this. I can review it.
Need a few clarifications regarding how to handle rare words and heuristics in the configuration
POS_UNK
is disabled to False ?