Open davidb-cerebras opened 3 months ago
@ArthurZucker Is it possible to fix this in tokenizers
?
Yep, you are right, I'll dive a bit to see why we have this!
Awesome thank you!
@ArthurZucker Is there a workaround in the meantime?
sorry not yet! I am fixing bunch of stuff, maybe #1568 ?
@maximilianmordig Cerebras has implemented a wrapper that corrects the buggy method, feel free to use the wrapper class here: https://github.com/Cerebras/modelzoo/blob/main/src/cerebras/modelzoo/data_preparation/data_preprocessing/custom_tokenizer_example/CustomLlama3Tokenizer.py
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hey, any update on this?
Hey! Sorry not yet, it's no my stack, and will investigate for the next release as there is a need from all of you! 🤗
Is there any whose stack this is who can try to resolve this?
I think it's ignore_merges
Opening a new issue for the previously opened issue here -- https://github.com/huggingface/tokenizers/issues/1517
Here we can see that the desired behavior for
return_offsets_mapping
from Mistral gives character indices corresponding to tokens:But for Llama-3 they are not correct
We can also see Llama-2 and GPT-2 working the same as Mistral, so Llama-3 is definitely the one performing behavior that is unexpected