marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.21k stars 227 forks source link

Marian's Sentencepiece Not Passing Case Encoding Command Through Marian #420

Open Kiryukhasemenov opened 5 months ago

Kiryukhasemenov commented 5 months ago

Summary:

I am trying to reproduce the new feature of your sentencepiece version presented in the paper. Although I can run it with your sentencepiece itself, it does not seem to work within the whole Marian's sentencepiece pipeline. The params seem to be passed through marian but lost on the way to sentencepiece.

Bug description

I was running the marian training together with the inbuilt sentencepiece vocabulary.

In the training configuration, I put the following parameters into the sentencepiece options:

sentencepiece-options: "--treat_whitespace_as_suffix --encode_unicode_case --remove_extra_whitespaces=false --encode_case --decode_case --character_coverage=0.988"

All the parameters were detected by the marian (see stdout.txt):

[2024-03-06 00:23:24] [config] sentencepiece-options: --treat_whitespace_as_suffix --encode_unicode_case

However, when sentencepiece is invoked, this param seems lost:

  encode_case: 0
  decode_case: 0

Necessary to add:

  1. I tried passing other parameters through sentencepiece options (such as --character_coverage), as well as explicit True values of the --treat_whitespace_as_suffix and --encode_unicode_case params. Finally, I tried various orderings of these parameters. Everything resulted with the same thing.
  2. I tried installing the marian's sentencepiece separately with this command:
    run spm_train --encode_unicode_case --treat_whitespace_as_suffix --input csuk_toy1M.txt --model_prefix case_encoded

    and it worked, it also was reflected in the log:

    normalizer_spec {
    ...
    encode_case: 1
    decode_case: 0
    }
    denormalizer_spec {
    ...
    encode_case: 0
    decode_case: 1
    }

Context

Will appreciate any help!

snukky commented 4 months ago

Thanks for reporting. Cc @rjai