issues
search
huggingface
/
tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.67k
stars
743
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add `.editorconfig` and `rustfmt.toml` for Consistent Code Formatting
#1508
tal7aouy
closed
1 month ago
1
Treatment of hyphenated words
#1507
rattle99
closed
1 month ago
3
Python Binding: Tokenizer.from_file() cannot parse JSON file of tokens
#1506
dwash96
closed
2 months ago
2
Failing to build bindings with 0.19.1
#1505
bryteise
closed
2 days ago
5
add serialization for `ignore_merges`
#1504
ArthurZucker
closed
2 months ago
1
corrected typo in the documentations for pre-tokenizers
#1503
GorkaUrbizu
closed
1 month ago
0
offline installation
#1502
HankLiu10
closed
1 month ago
3
Extended vocab tokenizer merging text into a single string without spaces while decoding
#1501
savanth14
closed
1 week ago
4
Issue in installing rudalle on google colab, !pip install rudalle
#1500
deepanshh786
closed
1 month ago
2
Fixing doc.
#1499
Narsil
closed
2 months ago
1
Bumping all versions 3 times (ty transformers :) )
#1498
Narsil
closed
2 months ago
1
Remove 3.13 (potential undefined behavior.)
#1497
Narsil
closed
2 months ago
1
StripAccents doesn't work
#1496
NivinaNull
closed
1 month ago
1
LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)`
#1495
michaelfeil
opened
2 months ago
7
PyO3 0.21.
#1494
Narsil
closed
2 months ago
1
Add more support for tiktoken based tokenizers
#1493
ArthurZucker
closed
2 months ago
1
Fix unsoundness in `tokenizers::utils::parallelism`
#1492
albertsgarde
closed
3 weeks ago
4
Unsound use of unsafe in `src/utils/parallelism.rs`
#1491
albertsgarde
closed
1 month ago
1
Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens)
#1490
bin123apple
closed
1 month ago
2
Discrepancy Between GitHub Release and NPM Package Version & Missing Dependencies
#1489
superBertBerg
opened
2 months ago
4
Fix data directory for test
#1488
atupone
closed
1 month ago
1
Is it possible to pass a tokenizer from Python into Rust?
#1487
albertsgarde
closed
1 month ago
2
Fix Strip decoder doc comment
#1486
jacklee1792
closed
1 month ago
0
error: casting `&T` to `&mut T` is undefined behavior
#1485
Jipok
closed
1 month ago
7
Candidate release
#1484
ArthurZucker
closed
2 months ago
1
fix: change var name from `vocab` to `vocab_file`
#1483
shenxiangzhuang
closed
1 month ago
0
fix: typo
#1482
shenxiangzhuang
closed
1 month ago
0
`BertWordPieceTokenizer` not saving with `sep_token` marked
#1481
AngledLuffa
closed
2 months ago
2
tokenizers-linux-x64-musl is not found when running inside node apline docker
#1480
madhurjya-acko
closed
1 month ago
2
Bump express from 4.18.1 to 4.19.2 in /tokenizers/examples/unstable_wasm/www
#1479
dependabot[bot]
closed
1 month ago
2
Bump webpack-dev-middleware from 5.3.3 to 5.3.4 in /tokenizers/examples/unstable_wasm/www
#1478
dependabot[bot]
closed
2 months ago
2
`cargo build` fails for python bindings when `--locked` is passed for `v0.15.1` and `v0.15.2`
#1477
CobaltCause
closed
1 month ago
4
Refactor metaspace
#1476
ArthurZucker
closed
3 months ago
7
Issue merging across whitespaces
#1475
henrycharlesworth
closed
2 months ago
2
BPE Decoder cleanup option
#1474
w-zygmuntowicz
closed
2 months ago
2
Assign `<unusedXX>` tokens with `special_tokens` without growing vocab size
#1473
jacobwjs
opened
3 months ago
4
Bump follow-redirects from 1.15.4 to 1.15.6 in /tokenizers/examples/unstable_wasm/www
#1472
dependabot[bot]
closed
2 months ago
2
Train tokenizer on integer lists, not strings
#1471
rteehas
opened
3 months ago
6
Tokens display issues
#1470
jordane95
closed
2 months ago
2
How to load tokenizer trained by sentencepiece or tiktoken
#1469
jordane95
closed
2 months ago
5
How to convert tokenizers.tokenizer to XXTokenizerFast in transformers?
#1468
rangehow
opened
3 months ago
3
New Update causes add_special_tokens not recognized
#1466
sravell
closed
2 months ago
4
Update pyproject.toml
#1465
stonebig
closed
2 months ago
0
Tokenizer dataset is very slow
#1464
ManuSinghYadav
closed
2 months ago
2
different output of AutoTokenizer from that of T5tokenizer
#1463
sm745052
closed
4 months ago
1
feat: support custom regexes for GPT pre-tokenizer
#1462
gcampax
opened
4 months ago
5
BpeTrainer seems to ignore max_token_length=1
#1461
geajack
closed
2 months ago
2
Training a tokenizer with limited memory
#1460
arxyzan
closed
2 months ago
3
build(node): Include binaries in NPM packing
#1459
aaronclong
closed
2 months ago
10
Potential vulnerability: Control token injection through Jinja templates in apply_chat_template
#1458
pluiez
closed
2 months ago
2
Previous
Next