Closed noe closed 5 years ago
Yes, word alignments need to be extracted from a text pre-preprocessed in the same way as training data.
The guided alignment can work with SentencePiece, but vocab(s) need to be created first and the data needs to be preprocessed manually. I usually follow these steps:
--guided-alignment
and with --after-batches 1
to get created vocab.spm
, the training will stop after first update../build/spm_encode --model=vocab.spm < raw.src > prep.src
and generate alignments with fast-align or a RNN model.The same approach we use for preparing files for --data-weighting
if we use the built-in SentencePiece.
Word alignments produced by marian-decoder
are for segmented source and target sentences.
Thanks a lot for the details.
Is there any reason to use marian
(step 1) to get the sentencepiece vocabularies instead of directly calling spm_train
?
No particular reason, spm_train
can be used directly.
marian-decoder
should receive unsegmented text as input (so that it performs the built-in sentencepiece segmentation), right?
If we should feed unsegmented text and the alignments are for segmented source and target, how can we convert subword alignment indexes to word (i.e. space-separated tokens) indexes?
This is not yet available in Marian. A possible solution is to write a script, which segments input and output sentences with spm_encode
and uses segmented texts to convert subword-level alignments to word-level alignments.
The Marian vocabulary generation used to be a bit smarter than the previous SentencePiece code, it does reservoir sampling over both corpora files when the vocabulary name is the same for both languages. That makes sure that the distribution of samples is uniform over the entire corpora files.
I think newer SentencePiece version to that now too. Our fork of SentencePiece is not updated yet.
I guess we can close this one?
https://github.com/marian-nmt/marian-dev/issues/466#issuecomment-513204668 @snukky Hi, do you have a example about using data-weighting, I don't have an intuitive sense of how to assign weights. BTW, can you recommend some relevant papres about data-weighting? Thank you!
@520jefferson What are you trying to achieve?
@emjotde i try to add some slot in src and tgt parallel sentences (mainly replace number,time,url etc) in preprocessing before training. in decoding step, if the numbers of the sentence will be replaced by different slots ,then decoding. after decoding, the slots will be replaced by slot translating results respectively. for example( i use ##NUM##flag to represent slot, flag can be 0,1,2,3...)
src: ##NUM##0 ##NUM##1 translating result:##NUM##0 ##NUM##0 but i expect this: ##NUM##1 *** ##NUM##0
the slot in src can't be translated into related slot in some cases in tgt in decoding step. So i think may be i can enhance the weighting of slot feature.
The documentation about
--guided-alignment
lacks some details that I would like to confirm:fast-align
) should be tokenized according to the vocabulary files, right? That is, if BPE is used, the alignment should be computed over the BPE-segmented text, correct?--guided-alignment
with the built-in sentencepiece tokenization? How?marian-decoder
, are the alignment indexes based on the internally segmented source tokens or on the space-separated tokens provided as input?