Closed zouharvi closed 3 years ago
Yes, --no-spm-decode
will keep translations segmented into subwords.
He needs the source segmentation as well, yes?
This is literally what @jerinphilip is working on for Bergamot at https://github.com/browsermt/marian-dev/pull/11
--no-spm-decode
is part of the decoder, not scorer. My goal is to get alignment scores for a given input and output.
He needs the source segmentation as well, yes?
Yes, this would solve the issue.
Since it's possible to get the subword segmentation using sentencepiece separatedly and hence there exitst a solution, I am closing this issue.
I'll keep an eye on what @jerinphilip is doing.
The output of the
--alignment
option seems to be an alignment on subword units rather than on the tokens themselves:Is it possible to also retreive the text segmentaion so that the word alignment can be aggregated from "subword alignment"?
I am not an MT practicioner, so I may be misunderstanding some concepts.