issues
search
huggingface
/
tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k
stars
779
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
is there a guidance to adapt tokenizers to c++ project?
#1440
7908459077d71a548753960a12b71146
closed
7 months ago
5
/
#1439
gabrielolympie
closed
8 months ago
0
Update release for python3.12 windows
#1438
ArthurZucker
closed
8 months ago
1
Encode special tokens
#1437
ArthurZucker
closed
8 months ago
1
[`remove black`] And use ruff
#1436
ArthurZucker
closed
6 months ago
1
Prepare RC 0
#1435
Narsil
closed
8 months ago
1
tokenizer.train_new_from_iterator() takes time
#1434
asphytheghoul
closed
7 months ago
2
Convert word counts to u64
#1433
stephenroller
closed
8 months ago
8
ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported. Getting this error when I try to run the below code:
#1431
SharathK-Tiger
closed
8 months ago
1
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www
#1430
dependabot[bot]
closed
8 months ago
1
Python3.12 build for Windows is not available
#1429
ghostplant
closed
8 months ago
3
Fix make bench.
#1428
Narsil
closed
9 months ago
1
Profile-Guided Optimization (PGO) benchmark results
#1426
zamazan4ik
closed
7 months ago
4
"make bench" command does not download all required resources
#1425
zamazan4ik
closed
9 months ago
0
Decoding Issue for Latin Characters in `added_tokens`
#1424
44670
closed
7 months ago
2
Possible bug in case of prepending chars in a pretokenizer
#1423
ivankrylatskoe
closed
5 months ago
9
loading `added_tokens.json`
#1422
kczimm
closed
9 months ago
3
Memory Leak in encode_batch Function
#1421
Atakey
closed
7 months ago
5
Add quick doc to byte_level.rs
#1420
steventrouble
closed
9 months ago
1
add option to skip special tokens
#1419
ArthurZucker
closed
8 months ago
5
Unsupported platform for tokenizers
#1418
KolbySisk
closed
8 months ago
2
Questions re: Tokenizer pipeline composability
#1417
ahgraber
closed
9 months ago
2
ModuleNotFoundError: No module named 'tokenizers.tokenizers'
#1416
supreetkt
closed
8 months ago
7
Support PyArrow arrays as tokenizer input
#1415
mariosasko
closed
2 months ago
12
Faster HF dataset iteration in docs
#1414
mariosasko
closed
9 months ago
1
Efficient Replace normalizer
#1413
rlrs
closed
8 months ago
9
Performance of tokenizer for CLIP text model
#1412
michael-p
closed
8 months ago
2
How to create Tokenizer.json?
#1410
kenaii
closed
8 months ago
2
Tokenizer **not saving/loading** correctly after adding tokens, then training
#1409
dinhanhx
closed
7 months ago
8
Special tokens will be split when there is no space before them
#1408
leizhao1234
closed
9 months ago
1
How to add byte_fallback tokens?
#1407
dinhanhx
opened
10 months ago
5
Release Candidate
#1406
ArthurZucker
closed
8 months ago
2
Tokenization is super slow when using XGLMTokenizer or XGLMTokenizerFast
#1405
jonas-klesen
closed
8 months ago
7
Stale bot.
#1404
Narsil
closed
10 months ago
1
Use NodeJs: Cannot find module 'tokenizers-darwin-arm64'
#1403
guotingchao
closed
5 months ago
8
Installation error with pip install tokenizers==0.12.1 – Compatibility issue with Python 3.6.15 and Rust 1.72.0
#1402
AhmetTasdemir
closed
8 months ago
11
Demonstrating Sentence Truncation in Tokenization
#1401
AliHaiderAhmad001
closed
9 months ago
3
Another Implementation (faster and more effecient) of BPE Training Algorithm
#1400
Yikai-Liao
closed
7 months ago
39
A whitespace character not displaying at a specific position
#1399
scissorstail
closed
10 months ago
2
Rust tokenizer fails!
#1398
arunpatro
closed
9 months ago
2
Integration with google/oss-fuzz for continuous fuzzing
#1397
silvergasp
closed
9 months ago
1
fuzz: Add a BPE training fuzzer
#1396
silvergasp
closed
9 months ago
1
train_new_from_iterator fails in non-space separated languages
#1395
frotaur
closed
8 months ago
5
Fix: fixing the inconsistency in byte-level tokenization when using pre_tokenizer.sequence.
#1394
junrae6454
closed
9 months ago
1
unable to install on python 3.12 via pip
#1393
binary-husky
closed
8 months ago
10
added_tokens with bytemap charaters in ByteLevel could not be decoded correctly
#1392
DOGEwbx
closed
2 months ago
9
How to split special token in encode?
#1391
leizhao1234
closed
9 months ago
5
udpate to version = "0.15.1-dev0"
#1390
ArthurZucker
closed
10 months ago
1
apply_chat_template() with tokenize=False returns incorrect string
#1389
Gnurro
closed
9 months ago
2
Release Candidate
#1388
ArthurZucker
closed
10 months ago
1
Previous
Next