stanford-crfm / BioMedLM

590 stars 61 forks source link

Tokenizer does not have a padding token #4

Closed tomcobley closed 1 year ago

tomcobley commented 1 year ago

Hi, thanks for sharing your model!

I am trying to use it to generate embeddings of batches of sequences of text of different lengths (Gene Ontology annotations). However, when I try to do this using huggingface, I get the following error at the tokenization stage.

Code:

tokenizer = GPT2Tokenizer.from_pretrained("stanford-crfm/pubmed_gpt_tokenizer")
inputs = tokenizer(sequences, padding=True, return_tensors="pt")

Error:

Asking to pad but the tokenizer does not have a padding token. Please select a token to use as 
`pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via 
`tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

How should I resolve this?

Thanks!

tomcobley commented 1 year ago

It seems this was a more general question which has an explanation here (the choice of pad_token doesn't actually matter since the attention mask causes padded indices to be ignored).

Adding the line tokenizer.pad_token = tokenizer.eos_token removed the error and solved the problem :).