huggingface tokenizers issues

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.68k stars 745 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Potential vulnerability: Control token injection through Jinja templates in apply_chat_template

#1458 pluiez closed 2 months ago
2
Support operating computer system

#1457 Southpika closed 3 months ago
2
Bump ip from 2.0.0 to 2.0.1 in /bindings/node

#1456 dependabot[bot] closed 3 months ago
1
wish of a Summary field in the MATADATA for Python tokenizers-0.15.1 wheels

#1455 stonebig closed 3 months ago
1
when compile with tch-rs library , encounter static libcpmt.lib and dynamic msvcprt.lib conflict link error

#1454 devdoer3 closed 4 months ago
2
Release candidate

#1453 ArthurZucker closed 4 months ago
1
thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22: AddedVocabulary bad split

#1452 pseudotensor opened 4 months ago
11
Accessing Tokenizer's model_max_length config

#1451 dopc closed 4 months ago
1
bugs

#1450 projecthorizon993 closed 4 months ago
5
Normalizer "replace" is quadratic in sequence length (impacts Llama 2 tokenizer)

#1449 dlwh closed 3 months ago
6
[Potential Bug] Mistral Tokenizer Inconsistencies

#1448 komninoschatzipapas closed 2 months ago
6
tokenizers.cpython-311-darwin.so wrong architecture

#1447 Casper-Mars closed 3 months ago
2
Building a tokenzier for tokenizing Java code

#1446 nimanthadilz closed 3 months ago
2
Outputting many different tokenizer vocab sizes for comparisons

#1445 pierrj closed 4 months ago
2
Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`.

#1444 eaplatanios closed 3 months ago
9
Added ability to inspect a 'Sequence' decoder and the `AddedVocabulary`.

#1443 eaplatanios closed 3 months ago
7
chore: Remove CLI - this was originally intended for local development

#1442 bryantbiggs closed 4 months ago
5
chore: Update dependencies to latest supported versions

#1441 bryantbiggs closed 5 months ago
1
is there a guidance to adapt tokenizers to c++ project?

#1440 7908459077d71a548753960a12b71146 closed 4 months ago
5
/

#1439 gabrielolympie closed 5 months ago
0
Update release for python3.12 windows

#1438 ArthurZucker closed 5 months ago
1
Encode special tokens

#1437 ArthurZucker closed 5 months ago
1
[`remove black`] And use ruff

#1436 ArthurZucker closed 3 months ago
1
Prepare RC 0

#1435 Narsil closed 5 months ago
1
tokenizer.train_new_from_iterator() takes time

#1434 asphytheghoul closed 4 months ago
2
Convert word counts to u64

#1433 stephenroller closed 4 months ago
8
ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported. Getting this error when I try to run the below code:

#1431 SharathK-Tiger closed 5 months ago
1
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www

#1430 dependabot[bot] closed 5 months ago
1
Python3.12 build for Windows is not available

#1429 ghostplant closed 5 months ago
3
Fix make bench.

#1428 Narsil closed 5 months ago
1
Profile-Guided Optimization (PGO) benchmark results

#1426 zamazan4ik closed 4 months ago
4
"make bench" command does not download all required resources

#1425 zamazan4ik closed 5 months ago
0
Decoding Issue for Latin Characters in `added_tokens`

#1424 44670 closed 4 months ago
2
Possible bug in case of prepending chars in a pretokenizer

#1423 ivankrylatskoe closed 1 month ago
9
loading `added_tokens.json`

#1422 kczimm closed 6 months ago
3
Memory Leak in encode_batch Function

#1421 Atakey closed 4 months ago
5
Add quick doc to byte_level.rs

#1420 steventrouble closed 6 months ago
1
add option to skip special tokens

#1419 ArthurZucker closed 5 months ago
5
Unsupported platform for tokenizers

#1418 KolbySisk closed 5 months ago
2
Questions re: Tokenizer pipeline composability

#1417 ahgraber closed 6 months ago
2
ModuleNotFoundError: No module named 'tokenizers.tokenizers'

#1416 supreetkt closed 5 months ago
6
Support PyArrow arrays as tokenizer input

#1415 mariosasko opened 6 months ago
11
Faster HF dataset iteration in docs

#1414 mariosasko closed 6 months ago
1
Efficient Replace normalizer

#1413 rlrs closed 4 months ago
9
Performance of tokenizer for CLIP text model

#1412 michael-p closed 5 months ago
2
How to create Tokenizer.json?

#1410 kenaii closed 5 months ago
2
Tokenizer **not saving/loading** correctly after adding tokens, then training

#1409 dinhanhx closed 4 months ago
8
Special tokens will be split when there is no space before them

#1408 leizhao1234 closed 6 months ago
1
How to add byte_fallback tokens?

#1407 dinhanhx opened 7 months ago
4
Release Candidate

#1406 ArthurZucker closed 5 months ago
2

Previous Next