questions about tokenizer

hi, i run the sample code: ''' from transformers import AutoTokenizer, OPTForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b") model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto")

input_text = "The Transformer architecture [START_REF]" input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0])) ''' the input_ids is : tensor([[ 592, 23121, 5219, 243, 4]]) but the token with id 23121 in tokenizer.json is 'ĠTransformer', not the 'Transformer'.

and why do not need to add the start token \<s> and the end token \<\/s> at the beginning and end respectively?

paperswithcode / galai

questions about tokenizer #79