Tokenizers: setting bos_token_id = 0 and adding language_pair_codes

sshleifer commented 4 years ago

I am unable to set bos_token_id=0 for a new SentencePiece tokenizer (MBART). Here is what I'm doing?

wget https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model

from transformers import T5Tokenizer
vocab_file = 'sentence.bpe.model'
t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0)
t2.bos_token_id # => 1

The following also returns 1

t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0,
                 additional_special_tokens=['<s>'])
t2.bos_token_id

Help much appreciated!

thomwolf commented 4 years ago

You can't set the ids, they are set automatically from the sentence piece model. But (1) why are you using the T5Tokenizer for a Bart checkpoint and (2) why do you want to tweak the id?

sshleifer commented 4 years ago

(1) I used the T5Tokenizer in order to make a runnable example that did not require checking out my mbart branch.

(2) Fairseq's MBART logic is split into two stages:

use spm_encode --model sentence.bpe.model to preprocess. (this is like encode_as_pieces in python).
use a vocab.json style lookup to convert each token to an ID.

I'm trying to do that in one step, using sp_model.encode_as_ids, but my ids are off by 1, because the special tokens (sp_model.bos_token, etc) are different than fairseq's dictionary object:

So I need to either manipulate the sp_model, retrain it with correct control codes, or try a different approach.

thomwolf commented 4 years ago

Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.

sshleifer commented 4 years ago

Extremely helpful! Mbart also adds a language code like en_XX and ro_RO to the end of the source and target sentences. So the sentences are like [tokens]+[<eos>, <language_id>]

Do we have any tokenizers that do that?

sshleifer commented 4 years ago

can't find an easy way to generate examples like

input_ids = [src_tokens]+[<eos>, <src_language_id>]
decoder_input_ids = [tgt_tokens]+[<eos>, <tgt_language_id>]

where the special tokens depend on the language.

My best idea is to add a method

def prepare_language_pair_batch(self, source_sentences, source_lang, target_sentences=None,  target_lang=None):
    # encode source sentence
    # if target_sentence is None ignore it else process it
    return {input_ids=encoded_source_ids, attention_mask=attention_mask, decoder_input_ids=processed_target}

(Could also overwrite prepare_inputs_for_model and add arguments.)

Two other ideas that don't quite work:

Try to stuff the language codes into the string as text in prepare_text_for_tokenization. The problem is this would go before EOS.
Try to do the magic in build_inputs_with_special_tokens. the problem is that you still can't use prepare_for_model because it doesn't pass kwargs to build_inputs_with_special_tokens.

We could also instantiate two tokenizers with different special tokens, but that feels wasteful.

@LysandreJik @patrickvonplaten

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kellymarchisio commented 2 years ago

Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.

For posterity, I think Thomas means this:

https://huggingface.co/transformers/v4.6.0/_modules/transformers/models/camembert/tokenization_camembert.html
https://huggingface.co/transformers/v3.5.1/_modules/transformers/tokenization_xlm_roberta.html

huggingface / transformers

Tokenizers: setting bos_token_id = 0 and adding language_pair_codes #3564