huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.09k stars 27.03k forks source link

Tokenizers: setting bos_token_id = 0 and adding language_pair_codes #3564

Closed sshleifer closed 4 years ago

sshleifer commented 4 years ago

I am unable to set bos_token_id=0 for a new SentencePiece tokenizer (MBART). Here is what I'm doing?

wget https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model
from transformers import T5Tokenizer
vocab_file = 'sentence.bpe.model'
t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0)
t2.bos_token_id # => 1

The following also returns 1

t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0,
                 additional_special_tokens=['<s>'])
t2.bos_token_id

Help much appreciated!

thomwolf commented 4 years ago

You can't set the ids, they are set automatically from the sentence piece model. But (1) why are you using the T5Tokenizer for a Bart checkpoint and (2) why do you want to tweak the id?

sshleifer commented 4 years ago

(1) I used the T5Tokenizer in order to make a runnable example that did not require checking out my mbart branch.

(2) Fairseq's MBART logic is split into two stages:

I'm trying to do that in one step, using sp_model.encode_as_ids, but my ids are off by 1, because the special tokens (sp_model.bos_token, etc) are different than fairseq's dictionary object:

image

So I need to either manipulate the sp_model, retrain it with correct control codes, or try a different approach.

thomwolf commented 4 years ago

Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.

sshleifer commented 4 years ago

Extremely helpful! Mbart also adds a language code like en_XX and ro_RO to the end of the source and target sentences. So the sentences are like [tokens]+[<eos>, <language_id>]

Do we have any tokenizers that do that?

sshleifer commented 4 years ago

can't find an easy way to generate examples like

input_ids = [src_tokens]+[<eos>, <src_language_id>]
decoder_input_ids = [tgt_tokens]+[<eos>, <tgt_language_id>]

where the special tokens depend on the language.

My best idea is to add a method

def prepare_language_pair_batch(self, source_sentences, source_lang, target_sentences=None,  target_lang=None):
    # encode source sentence
    # if target_sentence is None ignore it else process it
    return {input_ids=encoded_source_ids, attention_mask=attention_mask, decoder_input_ids=processed_target}

(Could also overwrite prepare_inputs_for_model and add arguments.)

Two other ideas that don't quite work:

We could also instantiate two tokenizers with different special tokens, but that feels wasteful.

@LysandreJik @patrickvonplaten

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kellymarchisio commented 2 years ago

Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.

For posterity, I think Thomas means this:

https://huggingface.co/transformers/v4.6.0/_modules/transformers/models/camembert/tokenization_camembert.html
https://huggingface.co/transformers/v3.5.1/_modules/transformers/tokenization_xlm_roberta.html