issues
search
togethercomputer
/
RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k
stars
346
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
ArXiv cleaning issue
#68
hicotton02
closed
11 months ago
1
Failed building wheel for cc-net
#67
hicotton02
closed
11 months ago
2
Unlock open science for dataset generation
#66
AbcSxyZ
opened
1 year ago
0
I got an issue when I use fasttext doing arxiv cleaning.
#65
tangtianyi1998
opened
1 year ago
1
We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet
#64
shawn0wang
closed
1 month ago
1
How do i prepare the data for Visualisation?
#63
dittops
opened
1 year ago
2
Understanding the quality filter
#62
yonatanbitton
closed
1 month ago
5
Fine tuning RedPajama Model
#61
adjhawar
closed
1 month ago
1
Single machine download script and downloaded files check
#60
MIracleyin
opened
1 year ago
0
Add arXiv download script and downloaded file check
#59
MIracleyin
closed
1 year ago
0
result not contain raw content
#58
newbietuan
closed
1 year ago
0
What is the cutoff.csv file mentioned in data_prep/cc/cc_net/cc_net/mine.py?
#57
julienliang2740
closed
1 year ago
2
Fixed data preparation for stack_exchange
#56
Taishi-N324
opened
1 year ago
0
Memory requirement for book deduplication?
#55
HYLcool
closed
1 year ago
2
error while download from url
#54
guozhiyao
closed
1 month ago
1
No file named github-prepare-local-dedup.sh
#53
feverdreamy
closed
1 year ago
1
Drive space to store
#52
tstandley
closed
1 year ago
1
how to process arXiv tex files without downloading?
#51
irene622
closed
1 year ago
1
The left portion of the dataset after each process
#50
kimcando
opened
1 year ago
0
Any forecast for the realese of v2 of the dataset ?
#49
vince62s
closed
1 month ago
0
Overlap between Common Crawl and C4
#48
codesoap
closed
1 year ago
6
how much disk memory will be used?
#47
newbietuan
opened
1 year ago
3
Language diversity
#46
averkij
closed
1 year ago
3
Partially downloaded datasets
#45
soboleva-daria
closed
1 month ago
1
The training data for Quality Classifier
#44
Anery
closed
1 year ago
4
about download a small portion of cc
#43
newbietuan
closed
1 year ago
2
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 2556: invalid start byte
#42
Anery
closed
1 year ago
1
fix bugs in hash using getpy
#41
feifeibear
closed
1 year ago
0
Question about Size Difference of arXiv Data in RedPajama and AWS S3
#40
Mr-Philo
closed
1 year ago
2
Common Crawl metadata
#39
mauriceweber
closed
1 month ago
0
add missing files to run cc_net with a given config
#38
feifeibear
opened
1 year ago
0
Questions about the quality classifier in common crawl
#37
ladit
closed
1 month ago
8
fixed some errors in Makefile for lm preparation
#36
feifeibear
opened
1 year ago
2
Expected finish time for processing one single index of commoncrawl?
#35
kimcando
closed
1 month ago
4
EOFError: Compressed file ended before the end-of-stream marker was reached
#34
kimcando
closed
1 year ago
2
Memory and space requirements
#33
zetian1025
closed
1 year ago
2
Improve CC readme and add URL to download classifier weights
#32
Ivan-Zhou
closed
1 year ago
0
How can you map the common crawl source back to metadata?
#31
craigschmidt
closed
1 year ago
2
If the program exit with the outside cause
#30
1787648106
closed
1 year ago
1
fix clean copyright
#29
hust-nj
opened
1 year ago
2
[Errno 2] No such file or directory: 'cutoff.csv'
#28
Anery
closed
1 year ago
5
Add missing script or update README.md
#27
geoffreydstewart
closed
1 year ago
3
How the 5 dumps of Common Crawl are selected?
#26
Stanislas0
closed
1 year ago
2
Script fixes in data_prep/github
#25
geoffreydstewart
closed
1 year ago
1
where is the FastText ptrtrained model to classify each CommonCrawl webpage
#24
yuhai-china
closed
1 year ago
4
Got error while runing `python -m cc_net -l my -l gu`
#23
tiendung
closed
1 year ago
9
Please consider adding a source of natural dialogue data
#22
wassname
closed
1 year ago
2
`No module named 'datasets'` in `data_prep/book/`
#21
danielpclark
closed
1 year ago
5
Fix typo in github/README.md
#20
eltociear
closed
1 year ago
1
will there be a trained model?
#19
rozek
closed
1 year ago
2
Previous
Next