Closed sshleifer closed 4 years ago
You can't set the ids, they are set automatically from the sentence piece model. But (1) why are you using the T5Tokenizer for a Bart checkpoint and (2) why do you want to tweak the id?
(1) I used the T5Tokenizer
in order to make a runnable example that did not require checking out my mbart
branch.
(2) Fairseq's MBART logic is split into two stages:
spm_encode --model sentence.bpe.model
to preprocess. (this is like encode_as_pieces in python).vocab.json
style lookup to convert each token to an ID.I'm trying to do that in one step, using sp_model.encode_as_ids
, but my ids are off by 1, because the special tokens (sp_model.bos_token, etc) are different than fairseq's dictionary object:
So I need to either manipulate the sp_model, retrain it with correct control codes, or try a different approach.
Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.
Extremely helpful! Mbart also adds a language code like en_XX and ro_RO to the end of the source and target sentences. So the sentences are like [tokens]+[<eos>, <language_id>]
Do we have any tokenizers that do that?
can't find an easy way to generate examples like
input_ids = [src_tokens]+[<eos>, <src_language_id>]
decoder_input_ids = [tgt_tokens]+[<eos>, <tgt_language_id>]
where the special tokens depend on the language.
My best idea is to add a method
def prepare_language_pair_batch(self, source_sentences, source_lang, target_sentences=None, target_lang=None):
# encode source sentence
# if target_sentence is None ignore it else process it
return {input_ids=encoded_source_ids, attention_mask=attention_mask, decoder_input_ids=processed_target}
(Could also overwrite prepare_inputs_for_model
and add arguments.)
Two other ideas that don't quite work:
prepare_text_for_tokenization
. The problem is this would go before EOS.build_inputs_with_special_tokens
. the problem is that you still can't use prepare_for_model
because it doesn't pass kwargs to build_inputs_with_special_tokens
.We could also instantiate two tokenizers with different special tokens, but that feels wasteful.
@LysandreJik @patrickvonplaten
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.
For posterity, I think Thomas means this:
https://huggingface.co/transformers/v4.6.0/_modules/transformers/models/camembert/tokenization_camembert.html
https://huggingface.co/transformers/v3.5.1/_modules/transformers/tokenization_xlm_roberta.html
I am unable to set bos_token_id=0 for a new SentencePiece tokenizer (MBART). Here is what I'm doing?
The following also returns 1
Help much appreciated!