--guided-alignment and tokenization

marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository

https://marian-nmt.github.io

Other

255 stars 125 forks source link

--guided-alignment and tokenization #466

Closed noe closed 5 years ago

noe commented 5 years ago

The documentation about --guided-alignment lacks some details that I would like to confirm:

The corpus fed to the alignment tool (e.g. fast-align) should be tokenized according to the vocabulary files, right? That is, if BPE is used, the alignment should be computed over the BPE-segmented text, correct?
Is it possible at all to use --guided-alignment with the built-in sentencepiece tokenization? How?
Assuming that it's possible, in the resulting alignment information generated by marian-decoder, are the alignment indexes based on the internally segmented source tokens or on the space-separated tokens provided as input?

snukky commented 5 years ago

Yes, word alignments need to be extracted from a text pre-preprocessed in the same way as training data.

The guided alignment can work with SentencePiece, but vocab(s) need to be created first and the data needs to be preprocessed manually. I usually follow these steps:

Start a training with desired configuration but without --guided-alignment and with --after-batches 1 to get created vocab.spm, the training will stop after first update.
Preprocess the data manually, e.g. ./build/spm_encode --model=vocab.spm < raw.src > prep.src and generate alignments with fast-align or a RNN model.
Start the final training on clean directory providing previously generated vocabs and word alignments.

The same approach we use for preparing files for --data-weighting if we use the built-in SentencePiece.

Word alignments produced by marian-decoder are for segmented source and target sentences.

noe commented 5 years ago

Thanks a lot for the details.

Is there any reason to use marian (step 1) to get the sentencepiece vocabularies instead of directly calling spm_train ?

snukky commented 5 years ago

No particular reason, spm_train can be used directly.

noe commented 5 years ago

marian-decoder should receive unsegmented text as input (so that it performs the built-in sentencepiece segmentation), right?

If we should feed unsegmented text and the alignments are for segmented source and target, how can we convert subword alignment indexes to word (i.e. space-separated tokens) indexes?

snukky commented 5 years ago

This is not yet available in Marian. A possible solution is to write a script, which segments input and output sentences with spm_encode and uses segmented texts to convert subword-level alignments to word-level alignments.

emjotde commented 5 years ago

The Marian vocabulary generation used to be a bit smarter than the previous SentencePiece code, it does reservoir sampling over both corpora files when the vocabulary name is the same for both languages. That makes sure that the distribution of samples is uniform over the entire corpora files.

I think newer SentencePiece version to that now too. Our fork of SentencePiece is not updated yet.

emjotde commented 5 years ago

I guess we can close this one?

520jefferson commented 5 years ago

https://github.com/marian-nmt/marian-dev/issues/466#issuecomment-513204668 @snukky Hi, do you have a example about using data-weighting, I don't have an intuitive sense of how to assign weights. BTW, can you recommend some relevant papres about data-weighting? Thank you!

emjotde commented 5 years ago

@520jefferson What are you trying to achieve?

520jefferson commented 5 years ago

@emjotde i try to add some slot in src and tgt parallel sentences (mainly replace number,time,url etc) in preprocessing before training. in decoding step, if the numbers of the sentence will be replaced by different slots ,then decoding. after decoding, the slots will be replaced by slot translating results respectively. for example( i use ##NUM##flag to represent slot, flag can be 0,1,2,3...)

src: ##NUM##0 ##NUM##1 translating result:##NUM##0 ##NUM##0 but i expect this: ##NUM##1 *** ##NUM##0

the slot in src can't be translated into related slot in some cases in tgt in decoding step. So i think may be i can enhance the weighting of slot feature.