huggingface tokenizers issues

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.67k stars 743 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Add `.editorconfig` and `rustfmt.toml` for Consistent Code Formatting

#1508 tal7aouy closed 1 month ago
1
Treatment of hyphenated words

#1507 rattle99 closed 1 month ago
3
Python Binding: Tokenizer.from_file() cannot parse JSON file of tokens

#1506 dwash96 closed 2 months ago
2
Failing to build bindings with 0.19.1

#1505 bryteise closed 2 days ago
5
add serialization for `ignore_merges`

#1504 ArthurZucker closed 2 months ago
1
corrected typo in the documentations for pre-tokenizers

#1503 GorkaUrbizu closed 1 month ago
0
offline installation

#1502 HankLiu10 closed 1 month ago
3
Extended vocab tokenizer merging text into a single string without spaces while decoding

#1501 savanth14 closed 1 week ago
4
Issue in installing rudalle on google colab, !pip install rudalle

#1500 deepanshh786 closed 1 month ago
2
Fixing doc.

#1499 Narsil closed 2 months ago
1
Bumping all versions 3 times (ty transformers :) )

#1498 Narsil closed 2 months ago
1
Remove 3.13 (potential undefined behavior.)

#1497 Narsil closed 2 months ago
1
StripAccents doesn't work

#1496 NivinaNull closed 1 month ago
1
LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)`

#1495 michaelfeil opened 2 months ago
7
PyO3 0.21.

#1494 Narsil closed 2 months ago
1
Add more support for tiktoken based tokenizers

#1493 ArthurZucker closed 2 months ago
1
Fix unsoundness in `tokenizers::utils::parallelism`

#1492 albertsgarde closed 3 weeks ago
4
Unsound use of unsafe in `src/utils/parallelism.rs`

#1491 albertsgarde closed 1 month ago
1
Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens)

#1490 bin123apple closed 1 month ago
2
Discrepancy Between GitHub Release and NPM Package Version & Missing Dependencies

#1489 superBertBerg opened 2 months ago
4
Fix data directory for test

#1488 atupone closed 1 month ago
1
Is it possible to pass a tokenizer from Python into Rust?

#1487 albertsgarde closed 1 month ago
2
Fix Strip decoder doc comment

#1486 jacklee1792 closed 1 month ago
0
error: casting `&T` to `&mut T` is undefined behavior

#1485 Jipok closed 1 month ago
7
Candidate release

#1484 ArthurZucker closed 2 months ago
1
fix: change var name from `vocab` to `vocab_file`

#1483 shenxiangzhuang closed 1 month ago
0
fix: typo

#1482 shenxiangzhuang closed 1 month ago
0
`BertWordPieceTokenizer` not saving with `sep_token` marked

#1481 AngledLuffa closed 2 months ago
2
tokenizers-linux-x64-musl is not found when running inside node apline docker

#1480 madhurjya-acko closed 1 month ago
2
Bump express from 4.18.1 to 4.19.2 in /tokenizers/examples/unstable_wasm/www

#1479 dependabot[bot] closed 1 month ago
2
Bump webpack-dev-middleware from 5.3.3 to 5.3.4 in /tokenizers/examples/unstable_wasm/www

#1478 dependabot[bot] closed 2 months ago
2
`cargo build` fails for python bindings when `--locked` is passed for `v0.15.1` and `v0.15.2`

#1477 CobaltCause closed 1 month ago
4
Refactor metaspace

#1476 ArthurZucker closed 3 months ago
7
Issue merging across whitespaces

#1475 henrycharlesworth closed 2 months ago
2
BPE Decoder cleanup option

#1474 w-zygmuntowicz closed 2 months ago
2
Assign `<unusedXX>` tokens with `special_tokens` without growing vocab size

#1473 jacobwjs opened 3 months ago
4
Bump follow-redirects from 1.15.4 to 1.15.6 in /tokenizers/examples/unstable_wasm/www

#1472 dependabot[bot] closed 2 months ago
2
Train tokenizer on integer lists, not strings

#1471 rteehas opened 3 months ago
6
Tokens display issues

#1470 jordane95 closed 2 months ago
2
How to load tokenizer trained by sentencepiece or tiktoken

#1469 jordane95 closed 2 months ago
5
How to convert tokenizers.tokenizer to XXTokenizerFast in transformers?

#1468 rangehow opened 3 months ago
3
New Update causes add_special_tokens not recognized

#1466 sravell closed 2 months ago
4
Update pyproject.toml

#1465 stonebig closed 2 months ago
0
Tokenizer dataset is very slow

#1464 ManuSinghYadav closed 2 months ago
2
different output of AutoTokenizer from that of T5tokenizer

#1463 sm745052 closed 4 months ago
1
feat: support custom regexes for GPT pre-tokenizer

#1462 gcampax opened 4 months ago
5
BpeTrainer seems to ignore max_token_length=1

#1461 geajack closed 2 months ago
2
Training a tokenizer with limited memory

#1460 arxyzan closed 2 months ago
3
build(node): Include binaries in NPM packing

#1459 aaronclong closed 2 months ago
10
Potential vulnerability: Control token injection through Jinja templates in apply_chat_template

#1458 pluiez closed 2 months ago
2

Previous Next