huggingface tokenizers issues

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.93k stars 779 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

is there a guidance to adapt tokenizers to c++ project?

#1440 7908459077d71a548753960a12b71146 closed 7 months ago
5
/

#1439 gabrielolympie closed 8 months ago
0
Update release for python3.12 windows

#1438 ArthurZucker closed 8 months ago
1
Encode special tokens

#1437 ArthurZucker closed 8 months ago
1
[`remove black`] And use ruff

#1436 ArthurZucker closed 6 months ago
1
Prepare RC 0

#1435 Narsil closed 8 months ago
1
tokenizer.train_new_from_iterator() takes time

#1434 asphytheghoul closed 7 months ago
2
Convert word counts to u64

#1433 stephenroller closed 8 months ago
8
ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported. Getting this error when I try to run the below code:

#1431 SharathK-Tiger closed 8 months ago
1
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www

#1430 dependabot[bot] closed 8 months ago
1
Python3.12 build for Windows is not available

#1429 ghostplant closed 8 months ago
3
Fix make bench.

#1428 Narsil closed 9 months ago
1
Profile-Guided Optimization (PGO) benchmark results

#1426 zamazan4ik closed 7 months ago
4
"make bench" command does not download all required resources

#1425 zamazan4ik closed 9 months ago
0
Decoding Issue for Latin Characters in `added_tokens`

#1424 44670 closed 7 months ago
2
Possible bug in case of prepending chars in a pretokenizer

#1423 ivankrylatskoe closed 5 months ago
9
loading `added_tokens.json`

#1422 kczimm closed 9 months ago
3
Memory Leak in encode_batch Function

#1421 Atakey closed 7 months ago
5
Add quick doc to byte_level.rs

#1420 steventrouble closed 9 months ago
1
add option to skip special tokens

#1419 ArthurZucker closed 8 months ago
5
Unsupported platform for tokenizers

#1418 KolbySisk closed 8 months ago
2
Questions re: Tokenizer pipeline composability

#1417 ahgraber closed 9 months ago
2
ModuleNotFoundError: No module named 'tokenizers.tokenizers'

#1416 supreetkt closed 8 months ago
7
Support PyArrow arrays as tokenizer input

#1415 mariosasko closed 2 months ago
12
Faster HF dataset iteration in docs

#1414 mariosasko closed 9 months ago
1
Efficient Replace normalizer

#1413 rlrs closed 8 months ago
9
Performance of tokenizer for CLIP text model

#1412 michael-p closed 8 months ago
2
How to create Tokenizer.json?

#1410 kenaii closed 8 months ago
2
Tokenizer **not saving/loading** correctly after adding tokens, then training

#1409 dinhanhx closed 7 months ago
8
Special tokens will be split when there is no space before them

#1408 leizhao1234 closed 9 months ago
1
How to add byte_fallback tokens?

#1407 dinhanhx opened 10 months ago
5
Release Candidate

#1406 ArthurZucker closed 8 months ago
2
Tokenization is super slow when using XGLMTokenizer or XGLMTokenizerFast

#1405 jonas-klesen closed 8 months ago
7
Stale bot.

#1404 Narsil closed 10 months ago
1
Use NodeJs: Cannot find module 'tokenizers-darwin-arm64'

#1403 guotingchao closed 5 months ago
8
Installation error with pip install tokenizers==0.12.1 – Compatibility issue with Python 3.6.15 and Rust 1.72.0

#1402 AhmetTasdemir closed 8 months ago
11
Demonstrating Sentence Truncation in Tokenization

#1401 AliHaiderAhmad001 closed 9 months ago
3
Another Implementation (faster and more effecient) of BPE Training Algorithm

#1400 Yikai-Liao closed 7 months ago
39
A whitespace character not displaying at a specific position

#1399 scissorstail closed 10 months ago
2
Rust tokenizer fails!

#1398 arunpatro closed 9 months ago
2
Integration with google/oss-fuzz for continuous fuzzing

#1397 silvergasp closed 9 months ago
1
fuzz: Add a BPE training fuzzer

#1396 silvergasp closed 9 months ago
1
train_new_from_iterator fails in non-space separated languages

#1395 frotaur closed 8 months ago
5
Fix: fixing the inconsistency in byte-level tokenization when using pre_tokenizer.sequence.

#1394 junrae6454 closed 9 months ago
1
unable to install on python 3.12 via pip

#1393 binary-husky closed 8 months ago
10
added_tokens with bytemap charaters in ByteLevel could not be decoded correctly

#1392 DOGEwbx closed 2 months ago
9
How to split special token in encode?

#1391 leizhao1234 closed 9 months ago
5
udpate to version = "0.15.1-dev0"

#1390 ArthurZucker closed 10 months ago
1
apply_chat_template() with tokenize=False returns incorrect string

#1389 Gnurro closed 9 months ago
2
Release Candidate

#1388 ArthurZucker closed 10 months ago
1

Previous Next