huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.99k stars 789 forks source link

Tokenizer.from_bytes() not available in python bindings #1567

Closed RamvigneshPasupathy closed 2 months ago

RamvigneshPasupathy commented 3 months ago

Looking for a "Tokenizer.from_bytes()" support in python, similar to the one in Rust - https://github.com/huggingface/tokenizers/issues/1013

Currently, it is not available in the python bindings code - https://github.com/huggingface/tokenizers/blob/v0.19.1/bindings/python/src/tokenizer.rs

Why this is needed?

ArthurZucker commented 3 months ago

Would you like to open a PR to add this featuyr? 🤗

RamvigneshPasupathy commented 3 months ago

Hi @ArthurZucker

I was going through the code once more with a view of contributing the method that I asked Tokenizer.from_bytes(); but then I figured out that the feature that I am expecting is already available in a different method name Tokenizer.from_buffer().

Tried a PoC of loading the tokenizer from file bytes of a tokenizer.json, and it works. Attaching screenshots; Plz close this issue if you find the PoC is good and this code will be an enough reference for anyone who is using huggingface tokenizers..

Page 1Page 2
image image
ArthurZucker commented 3 months ago

yeah maybe update the doc to make from buffer more findable?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.