huggingface tokenizers issues

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.66k stars 743 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Fast encode

#1560 ArthurZucker opened 1 week ago
1
Progress bar doesn't show in log file.

#1559 amssljc opened 1 week ago
2
Bump braces from 3.0.2 to 3.0.3 in /tokenizers/examples/unstable_wasm/www

#1558 dependabot[bot] opened 1 week ago
1
Bump ws from 8.8.1 to 8.17.1 in /tokenizers/examples/unstable_wasm/www

#1557 dependabot[bot] opened 1 week ago
1
`Encoding` object stub doesn't include `__len__`

#1556 thearchitector opened 1 week ago
2
Fix decode

#1555 ArthurZucker opened 1 week ago
1
make sure we don't warn on empty tokens

#1554 ArthurZucker closed 1 week ago
2
Llama-3 offset-mapping needs fixing

#1553 davidb-cerebras opened 2 weeks ago
4
[Bug?] Modifying normalizer for pretrained tokenizers don't consistently work

#1552 alvations opened 2 weeks ago
1
feat(ci): add trufflehog secrets detection

#1551 McPatate closed 2 weeks ago
1
Enable `dropout = 0.0` as an equivalent to `none` in BPE

#1550 mcognetta closed 5 days ago
6
How to use `TokenizerBuilder`?

#1549 polarathene opened 3 weeks ago
3
Fixing for clippy 1.78

#1548 Narsil closed 3 weeks ago
1
Switch from `cached_download` to `hf_hub_download` in tests

#1547 Wauplin closed 2 weeks ago
2
"Solution" to memory hogging in train_new_from_iterator with a hack

#1546 morphpiece opened 3 weeks ago
6
How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer?

#1545 LuoKaiGSW opened 3 weeks ago
5
[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.)

#1544 MilkClouds opened 3 weeks ago
2
llama3 tokenizer doesn't round trip

#1543 josharian opened 3 weeks ago
3
Add display capabilities to tokenizers objects

#1542 ArthurZucker opened 3 weeks ago
1
Deserializing BPE tokenizer failure

#1541 mcognetta closed 1 month ago
4
Adding pretty print of tokenizer

#1540 haixuanTao closed 3 weeks ago
2
Memory leak for large strings

#1539 noamgai21 opened 1 month ago
5
Training HuggingFace tokenizer - ignore_merges

#1537 ykoyfman opened 1 month ago
1
[BUG]Might be a bug in Unigram Trainer

#1536 Codesticker opened 1 month ago
0
feat: add support for pyarrow arrays as input

#1535 notjedi opened 1 month ago
8
How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer?

#1534 liuslnlp opened 1 month ago
3
Make `onig` crate non-optional

#1533 nathaniel-daniel opened 1 month ago
1
Make `USED_PARALLELISM` atomic

#1532 nathaniel-daniel closed 3 weeks ago
3
How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification

#1531 insookim43 closed 1 week ago
1
Converting `tokenizers` tokenizers into `tiktoken` tokenizers

#1530 umarbutler closed 9 hours ago
5
Bug with `CodeQwen1.5`: `data did not match any variant of untagged enum PyPreTokenizerTypeWrapper`

#1529 QwertyJack closed 1 month ago
1
Strange warnings with tokenizer for some models

#1528 EricLBuehler closed 1 month ago
5
Link to download the training text in `docs/source/quicktour.rst` is broken

#1526 14jdelap opened 1 month ago
5
How to write custom Wordpiece class?

#1525 xinyinan9527 opened 1 month ago
2
Convert huggingface tokenizer into sentencepiece format

#1524 RRaphaell opened 1 month ago
2
❓Get stats (e.g. counts) about the merged pairs

#1523 pietrolesci closed 2 weeks ago
3
Error: Cannot find module 'tokenizers/bindings/tokenizer'

#1522 meichangsu1 closed 2 weeks ago
1
remove enforcement of non special when adding tokens

#1521 ArthurZucker closed 1 month ago
2
Why are 'unknown' tokens randomly added to my tokenized input?

#1520 tshmak closed 2 months ago
2
Why the tokenizer is slower than tiktoken?

#1519 BigBinnie opened 2 months ago
5
Loading `tokenizer.model` with Rust API

#1518 EricLBuehler opened 2 months ago
10
Llama3 tokenizer with Incorrect offset_mapping

#1517 justin-shao closed 3 weeks ago
3
Tokens Removed from Trained Custom BPE Tokenizer

#1516 rteehas closed 2 months ago
0
UnigramTrainer: byte_fallback is false.

#1515 Moddus opened 2 months ago
2
BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased

#1514 Abhinay1997 opened 2 months ago
1
[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder

#1513 Narsil closed 1 month ago
6
Breaking changes in v0.19.1 for tiktoken/llama3

#1512 sanderland opened 2 months ago
6
Fix "dictionnary" typo

#1511 nprisbrey closed 2 weeks ago
3
change conditional compilation for regex libraries

#1510 semaraugusto opened 2 months ago
1
Cross-compilation fails for custom target

#1509 semaraugusto opened 2 months ago
2