issues
search
karpathy
/
minbpe
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.19k
stars
866
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
One problem in the annotations of `test_wikipedia_example` in the `tests/test_tokenizer` file
#93
donglinkang2021
opened
2 days ago
0
LLM is worse at non-English languages
#92
7CD
opened
1 week ago
0
Add pytest to requirements.txt
#91
shoumikhin
opened
3 weeks ago
0
add lexicographic ordering for breaking ties to make the tokenizer deterministic
#90
dapopov-st
opened
1 month ago
0
Add pyproject toml
#89
gianlucagiudice
opened
2 months ago
0
updating stats across merge to reduce computation
#88
imdaredevil
opened
2 months ago
0
Question about Encoder Logic
#87
JackxTong
opened
4 months ago
3
Update README.md
#86
wenming-ma
opened
4 months ago
0
Python API with C extensions for faster training and encoding
#85
benarnav
opened
4 months ago
0
Optimal algorithm for _encode_chunk(): 20% faster encoding, with 0.5% better COMPRESSION
#84
Majdoddin
opened
5 months ago
0
add running results
#83
copyrightly
closed
5 months ago
0
Deduplication of text chunks with frequency count, training and encoding 5x speedup
#82
Majdoddin
opened
5 months ago
0
LLM as calc
#81
michaelshekasta
opened
5 months ago
0
OSS-Fuzz Integration
#80
ennamarie19
opened
5 months ago
0
BPE in Haskell
#79
BobMcDear
opened
5 months ago
0
Add Optimized BatchTokenizer, Leave Others Unchanged
#78
alexandermorgan
closed
4 months ago
7
What to support GPT-4O tokenizer?
#77
echo-valor
opened
6 months ago
0
calling len(ids) in merge() function only once to increase performance
#76
crpatil1901
opened
6 months ago
0
Link to Mojo port added
#75
dorjeduck
opened
6 months ago
0
Notebook Issue In Google Colab
#74
kelixirr
opened
6 months ago
0
The regular expressions break all scripts with combining marks in the middle of the syllable
#73
ajaykg
opened
6 months ago
3
Count only nonoverlapping occurences of a pair
#72
Majdoddin
opened
6 months ago
0
Update regex.py to correctly parse scripts with combining marks
#71
ajaykg
opened
6 months ago
5
Amplifying your courses with my digital notes
#70
AayushSameerShah
opened
6 months ago
0
Instead of finding the one pair with the highest frequency and merging it at each step, do the highest N pairs
#69
hippietrail
opened
7 months ago
2
Using graph contraction for tokenization
#68
sramshetty
closed
7 months ago
0
add `gnp/minbpe-rs` as community extensions in `README.md`
#67
shubham0204
closed
7 months ago
0
`minbpe-rs`: A pure Rust implementation of `minbpe`
#66
shubham0204
opened
7 months ago
2
Much faster Regex tokenization using c++ and ctypes
#65
JohannesVod
opened
7 months ago
2
decode() method in GPT4Tokenizer does not handle special tokens
#64
Vakarva
opened
7 months ago
0
Updated decode() method in GPT4Tokenizer so that it handles special t…
#63
Vakarva
opened
7 months ago
0
[Test workflow] Add test workflow
#62
sayakpaul
closed
3 months ago
2
Would using prompts that contain concatenated words to reduce token count negatively affect results
#61
hatgit
opened
7 months ago
0
Implementation of LlamaTokenizer (without sentencepiece)
#60
MaveriQ
opened
7 months ago
0
"regex.py" file name conflict
#59
mogomaa79
opened
7 months ago
0
Huggingface already has an efficient implementation of this?
#58
laurislopata
opened
8 months ago
3
_
#57
momonga-ml
closed
8 months ago
0
Using minBPE token encoded sentence vectors need to be padded
#56
elevateclub
opened
8 months ago
1
Improved optimization strategy for 'Merge''
#55
hvaria
closed
8 months ago
2
Handle error when running out of pairs to merge
#54
vinhdq842
opened
8 months ago
0
updated self.vocab initialization and reuse self._build_vocab()
#53
muerghq
opened
8 months ago
0
add error handling for load method
#52
halannhile
closed
7 months ago
0
counting pairs is inaccurate for repeating tokens?
#51
JohannesVod
opened
8 months ago
5
Alternative to bpe
#50
marcov-dart
opened
8 months ago
16
Video2Post Generation Workflow
#49
xihajun
opened
8 months ago
3
Optimizing minbpe to also support video tokenization (extract low-dimensional latent patches from video frames)
#48
Jaykef
closed
7 months ago
1
Need train_multi.py example to show use with multiple input files
#47
gnp
closed
7 months ago
3
Minor improvement in the `base.py`
#46
konan009
closed
8 months ago
1
A thanks from self-learners community
#45
IamExperimenting
opened
8 months ago
0
how to deal with special tokens for multiple files
#44
IamExperimenting
opened
8 months ago
0
Next