stanford-crfm / BioMedLM

590 stars 61 forks source link

I set tokenizer.pad_token = tokenizer.eos_token and found tokenizer.pad_token_id==None, which leads to an error. #28

Open dlutmlt opened 2 months ago

dlutmlt commented 2 months ago

python3.7/site-packages/transformers/tokenization_utils_base.py", line 2387, in _get_padding_truncation_strategies if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0): TypeError: '<' not supported between instances of 'NoneType' and 'int'

when i debug the code, i find the variable "self.pad_token_id" is None, which leads to an error. But the variable self.pad_token is "<|endoftext|>", which is correct in GPT-2 style. It seems like there is not a "<|endoftext|>" symbol in the vocab.json file. So i want to know that how the BiomedLM control the stopping of generation.