issues
search
huggingface
/
tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.66k
stars
743
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Fast encode
#1560
ArthurZucker
opened
1 week ago
1
Progress bar doesn't show in log file.
#1559
amssljc
opened
1 week ago
2
Bump braces from 3.0.2 to 3.0.3 in /tokenizers/examples/unstable_wasm/www
#1558
dependabot[bot]
opened
1 week ago
1
Bump ws from 8.8.1 to 8.17.1 in /tokenizers/examples/unstable_wasm/www
#1557
dependabot[bot]
opened
1 week ago
1
`Encoding` object stub doesn't include `__len__`
#1556
thearchitector
opened
1 week ago
2
Fix decode
#1555
ArthurZucker
opened
1 week ago
1
make sure we don't warn on empty tokens
#1554
ArthurZucker
closed
1 week ago
2
Llama-3 offset-mapping needs fixing
#1553
davidb-cerebras
opened
2 weeks ago
4
[Bug?] Modifying normalizer for pretrained tokenizers don't consistently work
#1552
alvations
opened
2 weeks ago
1
feat(ci): add trufflehog secrets detection
#1551
McPatate
closed
2 weeks ago
1
Enable `dropout = 0.0` as an equivalent to `none` in BPE
#1550
mcognetta
closed
5 days ago
6
How to use `TokenizerBuilder`?
#1549
polarathene
opened
3 weeks ago
3
Fixing for clippy 1.78
#1548
Narsil
closed
3 weeks ago
1
Switch from `cached_download` to `hf_hub_download` in tests
#1547
Wauplin
closed
2 weeks ago
2
"Solution" to memory hogging in train_new_from_iterator with a hack
#1546
morphpiece
opened
3 weeks ago
6
How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer?
#1545
LuoKaiGSW
opened
3 weeks ago
5
[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.)
#1544
MilkClouds
opened
3 weeks ago
2
llama3 tokenizer doesn't round trip
#1543
josharian
opened
3 weeks ago
3
Add display capabilities to tokenizers objects
#1542
ArthurZucker
opened
3 weeks ago
1
Deserializing BPE tokenizer failure
#1541
mcognetta
closed
1 month ago
4
Adding pretty print of tokenizer
#1540
haixuanTao
closed
3 weeks ago
2
Memory leak for large strings
#1539
noamgai21
opened
1 month ago
5
Training HuggingFace tokenizer - ignore_merges
#1537
ykoyfman
opened
1 month ago
1
[BUG]Might be a bug in Unigram Trainer
#1536
Codesticker
opened
1 month ago
0
feat: add support for pyarrow arrays as input
#1535
notjedi
opened
1 month ago
8
How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer?
#1534
liuslnlp
opened
1 month ago
3
Make `onig` crate non-optional
#1533
nathaniel-daniel
opened
1 month ago
1
Make `USED_PARALLELISM` atomic
#1532
nathaniel-daniel
closed
3 weeks ago
3
How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification
#1531
insookim43
closed
1 week ago
1
Converting `tokenizers` tokenizers into `tiktoken` tokenizers
#1530
umarbutler
closed
9 hours ago
5
Bug with `CodeQwen1.5`: `data did not match any variant of untagged enum PyPreTokenizerTypeWrapper`
#1529
QwertyJack
closed
1 month ago
1
Strange warnings with tokenizer for some models
#1528
EricLBuehler
closed
1 month ago
5
Link to download the training text in `docs/source/quicktour.rst` is broken
#1526
14jdelap
opened
1 month ago
5
How to write custom Wordpiece class?
#1525
xinyinan9527
opened
1 month ago
2
Convert huggingface tokenizer into sentencepiece format
#1524
RRaphaell
opened
1 month ago
2
❓Get stats (e.g. counts) about the merged pairs
#1523
pietrolesci
closed
2 weeks ago
3
Error: Cannot find module 'tokenizers/bindings/tokenizer'
#1522
meichangsu1
closed
2 weeks ago
1
remove enforcement of non special when adding tokens
#1521
ArthurZucker
closed
1 month ago
2
Why are 'unknown' tokens randomly added to my tokenized input?
#1520
tshmak
closed
2 months ago
2
Why the tokenizer is slower than tiktoken?
#1519
BigBinnie
opened
2 months ago
5
Loading `tokenizer.model` with Rust API
#1518
EricLBuehler
opened
2 months ago
10
Llama3 tokenizer with Incorrect offset_mapping
#1517
justin-shao
closed
3 weeks ago
3
Tokens Removed from Trained Custom BPE Tokenizer
#1516
rteehas
closed
2 months ago
0
UnigramTrainer: byte_fallback is false.
#1515
Moddus
opened
2 months ago
2
BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased
#1514
Abhinay1997
opened
2 months ago
1
[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder
#1513
Narsil
closed
1 month ago
6
Breaking changes in v0.19.1 for tiktoken/llama3
#1512
sanderland
opened
2 months ago
6
Fix "dictionnary" typo
#1511
nprisbrey
closed
2 weeks ago
3
change conditional compilation for regex libraries
#1510
semaraugusto
opened
2 months ago
1
Cross-compilation fails for custom target
#1509
semaraugusto
opened
2 months ago
2
Next