karpathy minbpe issues - Githubissues

karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

MIT License

9.19k stars 866 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

One problem in the annotations of `test_wikipedia_example` in the `tests/test_tokenizer` file

#93 donglinkang2021 opened 2 days ago
0
LLM is worse at non-English languages

#92 7CD opened 1 week ago
0
Add pytest to requirements.txt

#91 shoumikhin opened 3 weeks ago
0
add lexicographic ordering for breaking ties to make the tokenizer deterministic

#90 dapopov-st opened 1 month ago
0
Add pyproject toml

#89 gianlucagiudice opened 2 months ago
0
updating stats across merge to reduce computation

#88 imdaredevil opened 2 months ago
0
Question about Encoder Logic

#87 JackxTong opened 4 months ago
3
Update README.md

#86 wenming-ma opened 4 months ago
0
Python API with C extensions for faster training and encoding

#85 benarnav opened 4 months ago
0
Optimal algorithm for _encode_chunk(): 20% faster encoding, with 0.5% better COMPRESSION

#84 Majdoddin opened 5 months ago
0
add running results

#83 copyrightly closed 5 months ago
0
Deduplication of text chunks with frequency count, training and encoding 5x speedup

#82 Majdoddin opened 5 months ago
0
LLM as calc

#81 michaelshekasta opened 5 months ago
0
OSS-Fuzz Integration

#80 ennamarie19 opened 5 months ago
0
BPE in Haskell

#79 BobMcDear opened 5 months ago
0
Add Optimized BatchTokenizer, Leave Others Unchanged

#78 alexandermorgan closed 4 months ago
7
What to support GPT-4O tokenizer？

#77 echo-valor opened 6 months ago
0
calling len(ids) in merge() function only once to increase performance

#76 crpatil1901 opened 6 months ago
0
Link to Mojo port added

#75 dorjeduck opened 6 months ago
0
Notebook Issue In Google Colab

#74 kelixirr opened 6 months ago
0
The regular expressions break all scripts with combining marks in the middle of the syllable

#73 ajaykg opened 6 months ago
3
Count only nonoverlapping occurences of a pair

#72 Majdoddin opened 6 months ago
0
Update regex.py to correctly parse scripts with combining marks

#71 ajaykg opened 6 months ago
5
Amplifying your courses with my digital notes

#70 AayushSameerShah opened 6 months ago
0
Instead of finding the one pair with the highest frequency and merging it at each step, do the highest N pairs

#69 hippietrail opened 7 months ago
2
Using graph contraction for tokenization

#68 sramshetty closed 7 months ago
0
add `gnp/minbpe-rs` as community extensions in `README.md`

#67 shubham0204 closed 7 months ago
0
`minbpe-rs`: A pure Rust implementation of `minbpe`

#66 shubham0204 opened 7 months ago
2
Much faster Regex tokenization using c++ and ctypes

#65 JohannesVod opened 7 months ago
2
decode() method in GPT4Tokenizer does not handle special tokens

#64 Vakarva opened 7 months ago
0
Updated decode() method in GPT4Tokenizer so that it handles special t…

#63 Vakarva opened 7 months ago
0
[Test workflow] Add test workflow

#62 sayakpaul closed 3 months ago
2
Would using prompts that contain concatenated words to reduce token count negatively affect results

#61 hatgit opened 7 months ago
0
Implementation of LlamaTokenizer (without sentencepiece)

#60 MaveriQ opened 7 months ago
0
"regex.py" file name conflict

#59 mogomaa79 opened 7 months ago
0
Huggingface already has an efficient implementation of this?

#58 laurislopata opened 8 months ago
3
_

#57 momonga-ml closed 8 months ago
0
Using minBPE token encoded sentence vectors need to be padded

#56 elevateclub opened 8 months ago
1
Improved optimization strategy for 'Merge''

#55 hvaria closed 8 months ago
2
Handle error when running out of pairs to merge

#54 vinhdq842 opened 8 months ago
0
updated self.vocab initialization and reuse self._build_vocab()

#53 muerghq opened 8 months ago
0
add error handling for load method

#52 halannhile closed 7 months ago
0
counting pairs is inaccurate for repeating tokens?

#51 JohannesVod opened 8 months ago
5
Alternative to bpe

#50 marcov-dart opened 8 months ago
16
Video2Post Generation Workflow

#49 xihajun opened 8 months ago
3
Optimizing minbpe to also support video tokenization (extract low-dimensional latent patches from video frames)

#48 Jaykef closed 7 months ago
1
Need train_multi.py example to show use with multiple input files

#47 gnp closed 7 months ago
3
Minor improvement in the `base.py`

#46 konan009 closed 8 months ago
1
A thanks from self-learners community

#45 IamExperimenting opened 8 months ago
0
how to deal with special tokens for multiple files

#44 IamExperimenting opened 8 months ago
0