paperswithcode / galai

Model API for GALACTICA
Apache License 2.0
2.67k stars 275 forks source link

questions about tokenizer #79

Closed nickyoungforu closed 1 year ago

nickyoungforu commented 1 year ago

hi, i run the sample code: ''' from transformers import AutoTokenizer, OPTForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b") model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto")

input_text = "The Transformer architecture [START_REF]" input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0])) ''' the input_ids is : tensor([[ 592, 23121, 5219, 243, 4]]) but the token with id 23121 in tokenizer.json is 'ĠTransformer', not the 'Transformer'.

and why do not need to add the start token \<s> and the end token \<\/s> at the beginning and end respectively?

mkardas commented 1 year ago

Hi, have a look at https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475/2 and check out Introduction to GALACTICA Models, especially "New document mode" section.