issues
search
togethercomputer
/
RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.59k
stars
350
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Estimated Cost for Arxiv Download
#120
amy-hyunji
closed
2 weeks ago
0
Adding linear ticket requirement to PRs
#119
TechnoFinch
closed
1 month ago
0
Filtering on Document Length
#118
karan-dalal
opened
3 months ago
0
Update README.md
#117
mauriceweber
closed
3 months ago
0
Inquiry About Character-Level Basis of Duplication Calculation
#116
luc1fer3
opened
4 months ago
1
Exact dedup details
#115
jordane95
opened
5 months ago
1
what does the prefix "rps_" mean?
#114
bpwl0121
closed
7 months ago
2
slow transfer speeds from URL sources
#113
axelmagn
opened
8 months ago
5
Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1
#112
konradipipan
opened
8 months ago
1
Inconsistent IDs lead to distributed computing woes.
#111
axelmagn
opened
8 months ago
1
Spanish artifact building error
#110
hicotton02
opened
8 months ago
2
Update README.md
#109
mauriceweber
opened
8 months ago
0
Potential Language Contamination Inquiry
#108
iBibek
opened
8 months ago
1
Step 2) "Invalid option: ---input_base_uri"
#107
timpal0l
opened
8 months ago
1
What purpose cutoff.csv used in the cc_net pipeline?
#106
kemalbastak
opened
9 months ago
2
About the final result
#105
Jdemon233
opened
9 months ago
2
Running the pipeline on cloud or a big data platform
#104
zllai
opened
9 months ago
1
Running full pipeline on a small part of CC
#103
zhentingqi
opened
9 months ago
0
Unavailable Parameters
#102
zhentingqi
opened
9 months ago
0
Invalid uri: ParseResult(...) must be of the form s3://<bucket>/<key> or file://<path>
#101
timpal0l
closed
10 months ago
0
Recommended way to load wget-downloaded data using HF datasets API?
#100
zijwang
opened
10 months ago
1
what's the specific meaning of dsir?
#99
BBetteroff
opened
10 months ago
4
Is there a specific meaning of the snapshot id?
#98
zijwang
closed
10 months ago
2
possibly missing shard from host
#97
sagnak
closed
10 months ago
2
What is the output of `run_lsh.py`?
#96
virendrakabra14
closed
10 months ago
8
Are shards randomly created?
#95
virendrakabra14
closed
10 months ago
1
Impossible unpack tail data... took time to download, but impossible to unpack dataset without quality signals with broken link.
#94
RuslanKovalyov
closed
10 months ago
1
Other language data
#93
Dzg0309
opened
11 months ago
4
Thresholds for all quality signals
#92
torshie
opened
11 months ago
2
Train a new wikiref model
#91
torshie
closed
10 months ago
1
Local data
#90
mauriceweber
closed
12 months ago
0
Low Data Downloading Speed
#89
lipingtang17
closed
10 months ago
1
Token counts
#88
timsueberkrueb
opened
1 year ago
2
where should I go to get the file about "domain_to_category_id.json"?
#87
suolyer
closed
1 year ago
0
regarding to quality classifier
#86
kimcando
opened
1 year ago
2
Update README.md
#85
mauriceweber
closed
1 year ago
0
Deduplicated version of RedPajama-v2
#84
joao-alves97
closed
10 months ago
4
Request: Enable artifact prep on local data
#83
hicotton02
closed
12 months ago
1
Invalid argument when running cc_net
#82
Practicinginhell
opened
1 year ago
2
How is the SHA1 digest computed?
#81
RicardoDominguez
closed
1 year ago
2
Executing V2 issues
#80
hicotton02
opened
1 year ago
6
regarding to deduplication
#79
kimcando
opened
1 year ago
6
cc_net processing local wet file
#78
hicotton02
closed
1 year ago
1
quality_signals, minhash and duplicates missing for tail
#77
Sheshansh
closed
1 year ago
1
New Features
#76
zhangce
opened
1 year ago
0
doc(README): remove typo
#75
Deep145757
closed
1 year ago
1
Issue on book datasets download
#74
beccabai
opened
1 year ago
2
fixes minor typo in data prep README
#73
jspeis
opened
1 year ago
0
cc-net failure on slurm cluster
#72
hicotton02
closed
1 year ago
0
Specifying arxiv dates
#71
matthieumeeus
opened
1 year ago
1
Next