Closed xenova closed 1 year ago
I would like to work on fixing this, @xenova!
@ArthurZucker I need some guidance here! I suppose this is not as simple as a regex replacement right? Should I contact the team members of Helsinki-NLP and get in touch with them for this or do you think there is a programmatical way to solve this?
Hey! Sure:
fast
issue (Meaning trying use_fast = False
and check the outputs as well. convert_slow_tokenizer.py
in transformers to see the conversion. That is were you will find if add_prefix_space
was used or not. Also check the normalizers
and post_processors
and decoders
! Cheers!
LlamaTokenizer
removes it). You can also search the codebase for SPIECE_UNDERLINE
and in each case when decoding it is removed. And this is not present for the MarianTokenizer
(which is what these models use).from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-es', use_fast=False)
tokenizer.decode(tokenizer("hello world")['input_ids'])
# outputs the same: '▁hello▁world</s>'
Thanks a lot, @xenova and @ArthurZucker for your comments!
From what I understand here, I need to change MarianTokenizer
so that it removes metaspaces characters and then re-convert the Helsinki-NLP/opus-*
models. Please correct me if I am wrong!
and then re-convert the Helsinki-NLP/opus-* models.
You shouldn't need to re-convert any models. The vocab.json, merges.txt, and tokenizer_config.json will also all stay the same.
All you should need to do is update MarianTokenizer
to replace the ▁
with
Got it, thanks @xenova! I used the same logic as LlamaTokenizer
but now instead of ▁hello▁world
as output, I get hello▁world
which is still wrong.
Should I use string replacement or regex to remove the metaspace character instead?
You could probably just do something similar to this:
but here. e.g.,
return out_string.strip()
→ return out_string.replace(SPIECE_UNDERLINE, " ").strip()
@ArthurZucker Is this good practice for sentencepiece tokenizers? From what I can tell, sp_model.decode_pieces
is not used very often, so this decode block might be quite outdated itself.
Thanks for the comment @xenova!
I did the following in my PR, is it acceptable too?
def convert_tokens_to_string(self, tokens: List[str]) -> str:
"""Uses source spm if _decode_use_source_tokenizer is True, and target spm otherwise"""
if tokens[0].startswith(SPIECE_UNDERLINE):
tokens[0] = tokens[0][1:]
# Other code in between
out_string += sp_model.decode_pieces(current_sub_tokens)
out_string = out_string.replace(SPIECE_UNDERLINE, " ")
return out_string.strip()
System Info
transformers
version: 4.34.0.dev0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Running
produces
▁hello▁world</s>
.produces
▁hello▁world
Expected behavior
The metaspace character (
▁
) should be removed, and the returned string should behello world</s>
andhello world
, respectively. This should be similar to:which produces
hello world