issues
search
huggingface
/
tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k
stars
746
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Release Candidate
#1406
ArthurZucker
closed
5 months ago
2
Tokenization is super slow when using XGLMTokenizer or XGLMTokenizerFast
#1405
deklesen
closed
5 months ago
7
Stale bot.
#1404
Narsil
closed
7 months ago
1
Use NodeJs: Cannot find module 'tokenizers-darwin-arm64'
#1403
guotingchao
closed
2 months ago
8
Installation error with pip install tokenizers==0.12.1 – Compatibility issue with Python 3.6.15 and Rust 1.72.0
#1402
AhmetTasdemir
closed
5 months ago
10
Demonstrating Sentence Truncation in Tokenization
#1401
AliHaiderAhmad001
closed
6 months ago
3
Another Implementation (faster and more effecient) of BPE Training Algorithm
#1400
Yikai-Liao
closed
4 months ago
29
A whitespace character not displaying at a specific position
#1399
scissorstail
closed
7 months ago
2
Rust tokenizer fails!
#1398
arunpatro
closed
6 months ago
2
Integration with google/oss-fuzz for continuous fuzzing
#1397
silvergasp
closed
6 months ago
1
fuzz: Add a BPE training fuzzer
#1396
silvergasp
closed
6 months ago
1
train_new_from_iterator fails in non-space separated languages
#1395
frotaur
closed
5 months ago
5
Fix: fixing the inconsistency in byte-level tokenization when using pre_tokenizer.sequence.
#1394
junrae6454
closed
6 months ago
1
unable to install on python 3.12 via pip
#1393
binary-husky
closed
5 months ago
10
added_tokens with bytemap charaters in ByteLevel could not be decoded correctly
#1392
DOGEwbx
opened
7 months ago
9
How to split special token in encode?
#1391
leizhao1234
closed
6 months ago
5
udpate to version = "0.15.1-dev0"
#1390
ArthurZucker
closed
7 months ago
1
apply_chat_template() with tokenize=False returns incorrect string
#1389
Gnurro
closed
6 months ago
2
Release Candidate
#1388
ArthurZucker
closed
7 months ago
1
is there a javascript version for tokenizers
#1387
Zwe1
closed
6 months ago
2
pyo3: update to 0.20
#1386
mikelui
closed
5 months ago
6
Allow `huggingface_hub<1.0`
#1385
Wauplin
closed
7 months ago
6
Error: Cannot find module 'tokenizers-linux-x64-musl'
#1384
Madnex
closed
5 months ago
6
Allow hf_hub 0.18
#1383
mariosasko
closed
8 months ago
4
Fix truncation length assertion
#1382
boyleconnor
closed
6 months ago
3
Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method
#1381
epwalsh
closed
7 months ago
2
Add tokens not impacted by training
#1380
StellaAthena
closed
6 months ago
6
Add C++ bindings by mlc-ai to README
#1379
ShukantPal
closed
6 months ago
0
Rename modeled `token_to_id`
#1378
chris-ha458
closed
6 months ago
2
Allow tokenizers to use huggingface_hub 0.18.0
#1377
clefourrier
closed
8 months ago
1
RobertaTokenizer : tokenizer.decode and tokenizer.tokenize do not generate the same output
#1376
BettyFabre
closed
6 months ago
4
Question: what is the add_special_tokens parameter of Tokenizer::encode?
#1375
EricLBuehler
closed
8 months ago
4
add_tokens has no effect in llama fast tokenizer
#1374
tiandiweizun
closed
8 months ago
1
Can not load tokoenizer from_pretrained through http_proxy since 0.14.0
#1373
jtsai-quid
closed
5 months ago
7
end_of_word_suffix = "</w>" no work??
#1372
longday1102
closed
7 months ago
3
fix: remove useless token
#1371
rtrompier
closed
8 months ago
1
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node
#1370
dependabot[bot]
closed
8 months ago
0
`BPE` tokenization model does not respect custom `RegEx` via `Split` pre-tokenizer
#1369
hogru
closed
5 months ago
8
How can we ignore special tokens when encoding text
#1368
DOGEwbx
closed
5 months ago
8
Fix doc links in readme
#1367
Pierrci
closed
6 months ago
1
Warnings for added tokens not present in the vocab
#1366
jneuff
closed
6 months ago
7
cannot install with yarn & missing module in npm
#1365
MaelAbgrall
closed
5 months ago
6
Wrapping Tokenizer leads to version error
#1364
shivanraptor
closed
8 months ago
3
Difference between slow and fast GPT2 tokenizers
#1363
goerch
closed
8 months ago
9
When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces?
#1362
enze5088
closed
8 months ago
4
Exception: Custom Normalizer cannot be serialized
#1361
shivanraptor
closed
8 months ago
1
Errors "Using sep_token, but it is not set yet." loading tokenizer trained from scratch
#1360
velocityCavalry
closed
6 months ago
3
bc3ec39d breaks the compilation (as noted in #1355)
#1359
baptisterajaut
closed
4 months ago
13
Different behaviour of BPE encoder after update to 0.14.1
#1358
DOGEwbx
closed
8 months ago
14
[`pre_tokenizers`] Fix sentencepiece based Metaspace
#1357
ArthurZucker
closed
7 months ago
3
Previous
Next