scorer (sub)word alignment

marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository

https://marian-nmt.github.io

Other

255 stars 125 forks source link

Closed zouharvi closed 3 years ago

zouharvi commented 3 years ago

The output of the --alignment option seems to be an alignment on subword units rather than on the tokens themselves:

Hello there ||| Hallo da
0-0 1-1 2-2
Hello ||| HalloHalloHalloHallo
0-0 1-1 1-2 1-3 1-4 1-5 1-6 1-7

Is it possible to also retreive the text segmentaion so that the word alignment can be aggregated from "subword alignment"?

I am not an MT practicioner, so I may be misunderstanding some concepts.

snukky commented 3 years ago

Yes, --no-spm-decode will keep translations segmented into subwords.

kpu commented 3 years ago

He needs the source segmentation as well, yes?

This is literally what @jerinphilip is working on for Bergamot at https://github.com/browsermt/marian-dev/pull/11

zouharvi commented 3 years ago

--no-spm-decode is part of the decoder, not scorer. My goal is to get alignment scores for a given input and output.

He needs the source segmentation as well, yes?

Yes, this would solve the issue.

zouharvi commented 3 years ago

Since it's possible to get the subword segmentation using sentencepiece separatedly and hence there exitst a solution, I am closing this issue.

I'll keep an eye on what @jerinphilip is doing.