issues
search
huggingface
/
tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k
stars
745
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Potential vulnerability: Control token injection through Jinja templates in apply_chat_template
#1458
pluiez
closed
2 months ago
2
Support operating computer system
#1457
Southpika
closed
3 months ago
2
Bump ip from 2.0.0 to 2.0.1 in /bindings/node
#1456
dependabot[bot]
closed
3 months ago
1
wish of a Summary field in the MATADATA for Python tokenizers-0.15.1 wheels
#1455
stonebig
closed
3 months ago
1
when compile with tch-rs library , encounter static libcpmt.lib and dynamic msvcprt.lib conflict link error
#1454
devdoer3
closed
4 months ago
2
Release candidate
#1453
ArthurZucker
closed
4 months ago
1
thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22: AddedVocabulary bad split
#1452
pseudotensor
opened
4 months ago
11
Accessing Tokenizer's model_max_length config
#1451
dopc
closed
4 months ago
1
bugs
#1450
projecthorizon993
closed
4 months ago
5
Normalizer "replace" is quadratic in sequence length (impacts Llama 2 tokenizer)
#1449
dlwh
closed
3 months ago
6
[Potential Bug] Mistral Tokenizer Inconsistencies
#1448
komninoschatzipapas
closed
2 months ago
6
tokenizers.cpython-311-darwin.so wrong architecture
#1447
Casper-Mars
closed
3 months ago
2
Building a tokenzier for tokenizing Java code
#1446
nimanthadilz
closed
3 months ago
2
Outputting many different tokenizer vocab sizes for comparisons
#1445
pierrj
closed
4 months ago
2
Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`.
#1444
eaplatanios
closed
3 months ago
9
Added ability to inspect a 'Sequence' decoder and the `AddedVocabulary`.
#1443
eaplatanios
closed
3 months ago
7
chore: Remove CLI - this was originally intended for local development
#1442
bryantbiggs
closed
4 months ago
5
chore: Update dependencies to latest supported versions
#1441
bryantbiggs
closed
5 months ago
1
is there a guidance to adapt tokenizers to c++ project?
#1440
7908459077d71a548753960a12b71146
closed
4 months ago
5
/
#1439
gabrielolympie
closed
5 months ago
0
Update release for python3.12 windows
#1438
ArthurZucker
closed
5 months ago
1
Encode special tokens
#1437
ArthurZucker
closed
5 months ago
1
[`remove black`] And use ruff
#1436
ArthurZucker
closed
3 months ago
1
Prepare RC 0
#1435
Narsil
closed
5 months ago
1
tokenizer.train_new_from_iterator() takes time
#1434
asphytheghoul
closed
4 months ago
2
Convert word counts to u64
#1433
stephenroller
closed
4 months ago
8
ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported. Getting this error when I try to run the below code:
#1431
SharathK-Tiger
closed
5 months ago
1
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www
#1430
dependabot[bot]
closed
5 months ago
1
Python3.12 build for Windows is not available
#1429
ghostplant
closed
5 months ago
3
Fix make bench.
#1428
Narsil
closed
5 months ago
1
Profile-Guided Optimization (PGO) benchmark results
#1426
zamazan4ik
closed
4 months ago
4
"make bench" command does not download all required resources
#1425
zamazan4ik
closed
5 months ago
0
Decoding Issue for Latin Characters in `added_tokens`
#1424
44670
closed
4 months ago
2
Possible bug in case of prepending chars in a pretokenizer
#1423
ivankrylatskoe
closed
1 month ago
9
loading `added_tokens.json`
#1422
kczimm
closed
6 months ago
3
Memory Leak in encode_batch Function
#1421
Atakey
closed
4 months ago
5
Add quick doc to byte_level.rs
#1420
steventrouble
closed
6 months ago
1
add option to skip special tokens
#1419
ArthurZucker
closed
5 months ago
5
Unsupported platform for tokenizers
#1418
KolbySisk
closed
5 months ago
2
Questions re: Tokenizer pipeline composability
#1417
ahgraber
closed
6 months ago
2
ModuleNotFoundError: No module named 'tokenizers.tokenizers'
#1416
supreetkt
closed
5 months ago
6
Support PyArrow arrays as tokenizer input
#1415
mariosasko
opened
6 months ago
11
Faster HF dataset iteration in docs
#1414
mariosasko
closed
6 months ago
1
Efficient Replace normalizer
#1413
rlrs
closed
4 months ago
9
Performance of tokenizer for CLIP text model
#1412
michael-p
closed
5 months ago
2
How to create Tokenizer.json?
#1410
kenaii
closed
5 months ago
2
Tokenizer **not saving/loading** correctly after adding tokens, then training
#1409
dinhanhx
closed
4 months ago
8
Special tokens will be split when there is no space before them
#1408
leizhao1234
closed
6 months ago
1
How to add byte_fallback tokens?
#1407
dinhanhx
opened
7 months ago
4
Release Candidate
#1406
ArthurZucker
closed
5 months ago
2
Previous
Next