Closed EgorBu closed 10 months ago
Thanks! Unfortunately, I can't seem to reproduce this:
If you have other scenarios where is happens let us know!
Thanks a lot! Can you suggest how to debug this problem, @slundberg? It looks like super basic and useful feature, I'm surprised that I have some issues (and it used to work in previous versions before major updates)
I checked huggingface page for phi-1.5 model - and found related issue Some comments from there:
Size of tokenizer vocab is 50257, while size of vocab in config is 51200.
...
Hi there,
I understand that it works fine as long as tokenizer.vocab_size <= model.layers[0].wte.weight.shape[0], but it seems that the number 50257 is actually incorrect.
When you count unique indices in the vocabulary, including added_tokens, the correct number appears to be 50295 instead.
I am not knowledgeable about how this attribute is configured when initializing the tokenizer, but this issue may need to be fixed because sometimes we want to access the value through this attribute (tokenizer.vocab_size).
...
This is the expected behavior of transformers. Please check this issue: https://github.com/huggingface/transformers/issues/12632
...
I'm afraid but the link you suggested doesn't seem very relevant to the issue.
Of course, we can get the actual vocabulary size with len(tokenizer.get_vocab()) or something.
However, the added_tokens are incorporated by default without users specifying them, as defined in [tokenizer.json](https://huggingface.co/microsoft/phi-1_5/blob/main/tokenizer.json).
Given that the argument is supposed to be passed by users, I would not consider this as an "expected behavior" of the library.
The current implementation can cause errors for future users relying on the (presumably widely used)vocab_size attribute, so it would be better off corrected, maybe by moving the additional tokens into the default ones.
Thanks for your response.
and it looks like that guidance
is using tkz.vocab_size
here instead of len(tkz)
- that is causing the IndexError
(TBH I'm surprised that it's not reproduced - somehow we have different distribution of probabilities for tokens if I'm correctly understand what's happening)
https://github.com/guidance-ai/guidance/pull/460 - created PR with fix
Fix was merged.
The bug Loaded transformer with guidance fails with error when using select
To Reproduce Give a full working code snippet that can be pasted into a notebook cell or python file. Make sure to include the LLM load step so we know which model you are using.
error log:
System info (please complete the following information):
guidance.__version__
): '0.1.1'