huggingface tokenizers issues

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.68k stars 746 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Release Candidate

#1406 ArthurZucker closed 5 months ago
2
Tokenization is super slow when using XGLMTokenizer or XGLMTokenizerFast

#1405 deklesen closed 5 months ago
7
Stale bot.

#1404 Narsil closed 7 months ago
1
Use NodeJs: Cannot find module 'tokenizers-darwin-arm64'

#1403 guotingchao closed 2 months ago
8
Installation error with pip install tokenizers==0.12.1 – Compatibility issue with Python 3.6.15 and Rust 1.72.0

#1402 AhmetTasdemir closed 5 months ago
10
Demonstrating Sentence Truncation in Tokenization

#1401 AliHaiderAhmad001 closed 6 months ago
3
Another Implementation (faster and more effecient) of BPE Training Algorithm

#1400 Yikai-Liao closed 4 months ago
29
A whitespace character not displaying at a specific position

#1399 scissorstail closed 7 months ago
2
Rust tokenizer fails!

#1398 arunpatro closed 6 months ago
2
Integration with google/oss-fuzz for continuous fuzzing

#1397 silvergasp closed 6 months ago
1
fuzz: Add a BPE training fuzzer

#1396 silvergasp closed 6 months ago
1
train_new_from_iterator fails in non-space separated languages

#1395 frotaur closed 5 months ago
5
Fix: fixing the inconsistency in byte-level tokenization when using pre_tokenizer.sequence.

#1394 junrae6454 closed 6 months ago
1
unable to install on python 3.12 via pip

#1393 binary-husky closed 5 months ago
10
added_tokens with bytemap charaters in ByteLevel could not be decoded correctly

#1392 DOGEwbx opened 7 months ago
9
How to split special token in encode?

#1391 leizhao1234 closed 6 months ago
5
udpate to version = "0.15.1-dev0"

#1390 ArthurZucker closed 7 months ago
1
apply_chat_template() with tokenize=False returns incorrect string

#1389 Gnurro closed 6 months ago
2
Release Candidate

#1388 ArthurZucker closed 7 months ago
1
is there a javascript version for tokenizers

#1387 Zwe1 closed 6 months ago
2
pyo3: update to 0.20

#1386 mikelui closed 5 months ago
6
Allow `huggingface_hub<1.0`

#1385 Wauplin closed 7 months ago
6
Error: Cannot find module 'tokenizers-linux-x64-musl'

#1384 Madnex closed 5 months ago
6
Allow hf_hub 0.18

#1383 mariosasko closed 8 months ago
4
Fix truncation length assertion

#1382 boyleconnor closed 6 months ago
3
Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method

#1381 epwalsh closed 7 months ago
2
Add tokens not impacted by training

#1380 StellaAthena closed 6 months ago
6
Add C++ bindings by mlc-ai to README

#1379 ShukantPal closed 6 months ago
0
Rename modeled `token_to_id`

#1378 chris-ha458 closed 6 months ago
2
Allow tokenizers to use huggingface_hub 0.18.0

#1377 clefourrier closed 8 months ago
1
RobertaTokenizer : tokenizer.decode and tokenizer.tokenize do not generate the same output

#1376 BettyFabre closed 6 months ago
4
Question: what is the add_special_tokens parameter of Tokenizer::encode?

#1375 EricLBuehler closed 8 months ago
4
add_tokens has no effect in llama fast tokenizer

#1374 tiandiweizun closed 8 months ago
1
Can not load tokoenizer from_pretrained through http_proxy since 0.14.0

#1373 jtsai-quid closed 5 months ago
7
end_of_word_suffix = "</w>" no work??

#1372 longday1102 closed 7 months ago
3
fix: remove useless token

#1371 rtrompier closed 8 months ago
1
Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node

#1370 dependabot[bot] closed 8 months ago
0
`BPE` tokenization model does not respect custom `RegEx` via `Split` pre-tokenizer

#1369 hogru closed 5 months ago
8
How can we ignore special tokens when encoding text

#1368 DOGEwbx closed 5 months ago
8
Fix doc links in readme

#1367 Pierrci closed 6 months ago
1
Warnings for added tokens not present in the vocab

#1366 jneuff closed 6 months ago
7
cannot install with yarn & missing module in npm

#1365 MaelAbgrall closed 5 months ago
6
Wrapping Tokenizer leads to version error

#1364 shivanraptor closed 8 months ago
3
Difference between slow and fast GPT2 tokenizers

#1363 goerch closed 8 months ago
9
When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces?

#1362 enze5088 closed 8 months ago
4
Exception: Custom Normalizer cannot be serialized

#1361 shivanraptor closed 8 months ago
1
Errors "Using sep_token, but it is not set yet." loading tokenizer trained from scratch

#1360 velocityCavalry closed 6 months ago
3
bc3ec39d breaks the compilation (as noted in #1355)

#1359 baptisterajaut closed 4 months ago
13
Different behaviour of BPE encoder after update to 0.14.1

#1358 DOGEwbx closed 8 months ago
14
[`pre_tokenizers`] Fix sentencepiece based Metaspace

#1357 ArthurZucker closed 7 months ago
3

Previous Next