issues
search
shaigue
/
pmi_masking
This repository contains code that takes a text corpus and creates a PMI masking vocabulary for it.
MIT License
1
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
added word level tokenization support
#34
shaigue
closed
1 year ago
0
Add support for RedPajama
#33
shaigue
opened
1 year ago
0
does having a primary key in the table actuall nessesary / helpful?
#32
shaigue
opened
1 year ago
0
extract specs on the remote environments from the logs
#31
shaigue
opened
1 year ago
0
Add support for word Level tokenization
#30
shaigue
opened
1 year ago
0
deal with wikipedia bug
#29
shaigue
opened
1 year ago
1
write a blog post
#28
shaigue
opened
1 year ago
0
Try to optimize `count_ngrams_in_batches`
#27
shaigue
opened
1 year ago
0
autotune batch sizes according to system spec
#26
shaigue
opened
1 year ago
0
run medium bookcorpus on linux system to verify that everything is working well, and recover logging, and analyze output
#25
shaigue
opened
1 year ago
0
try to disentangle the dataset loading from my code, so that anyone could provide it's own dataset.
#24
shaigue
opened
1 year ago
0
setup and run tests on a remote machine with linux
#23
shaigue
opened
1 year ago
0
Shai/refactor main enterence script
#22
shaigue
closed
1 year ago
0
add checkpoints to save intermiddiate results in case of unexpected shutdown
#21
shaigue
opened
1 year ago
0
add random sampling support?
#20
shaigue
opened
1 year ago
0
Code review for create_pmi_masking_vocab.py
#19
shaigue
opened
1 year ago
0
write a descriptive README.md
#18
shaigue
opened
1 year ago
0
Try to optimize `aggregate_ngram_counts`
#17
shaigue
opened
1 year ago
2
document the performance results
#16
shaigue
opened
1 year ago
0
remove redundent files from the repo
#15
shaigue
opened
1 year ago
1
integration with LLM training code
#14
shaigue
opened
1 year ago
1
Visualize the flow of the program. This can help people understand what is going on
#13
shaigue
opened
1 year ago
0
prunning & sampling strategies to handle large amounts of data
#12
shaigue
opened
1 year ago
0
create pmi masking vocabulary for RedPajama
#11
shaigue
opened
1 year ago
0
reproduce results on wiki+bookcorpus
#10
shaigue
opened
1 year ago
1
add remote logging/monitoring
#9
shaigue
opened
1 year ago
0
add progress bar for the different stages
#8
shaigue
opened
1 year ago
0
find a way to test this code on a linux before handing it out.
#7
shaigue
opened
1 year ago
0
add support for different datasets structures
#6
shaigue
opened
1 year ago
0
add description of logging messages to the README.md
#5
shaigue
opened
1 year ago
0
add description for the main enterence script to README.md
#4
shaigue
opened
1 year ago
0
figure out how to do code review
#3
shaigue
opened
1 year ago
0
main enterence script with all arguments from command line
#2
shaigue
opened
1 year ago
0
finer grain control over which dataset to load
#1
shaigue
opened
1 year ago
0