issues
search
togethercomputer
/
RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k
stars
346
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Filtering on Document Length
#118
karan-dalal
opened
1 month ago
0
Update README.md
#117
mauriceweber
closed
1 month ago
0
Inquiry About Character-Level Basis of Duplication Calculation
#116
luc1fer3
opened
2 months ago
1
Exact dedup details
#115
jordane95
opened
4 months ago
1
what does the prefix "rps_" mean?
#114
bpwl0121
closed
5 months ago
2
slow transfer speeds from URL sources
#113
axelmagn
opened
6 months ago
5
Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1
#112
konradipipan
opened
6 months ago
1
Inconsistent IDs lead to distributed computing woes.
#111
axelmagn
opened
6 months ago
1
Spanish artifact building error
#110
hicotton02
opened
6 months ago
2
Update README.md
#109
mauriceweber
opened
6 months ago
0
Potential Language Contamination Inquiry
#108
iBibek
opened
6 months ago
1
Step 2) "Invalid option: ---input_base_uri"
#107
timpal0l
opened
6 months ago
1
What purpose cutoff.csv used in the cc_net pipeline?
#106
kemalbastak
opened
7 months ago
2
About the final result
#105
Jdemon233
opened
7 months ago
2
Running the pipeline on cloud or a big data platform
#104
zllai
opened
7 months ago
1
Running full pipeline on a small part of CC
#103
zhentingqi
opened
7 months ago
0
Unavailable Parameters
#102
zhentingqi
opened
7 months ago
0
Invalid uri: ParseResult(...) must be of the form s3://<bucket>/<key> or file://<path>
#101
timpal0l
closed
8 months ago
0
Recommended way to load wget-downloaded data using HF datasets API?
#100
zijwang
opened
8 months ago
1
what's the specific meaning of dsir?
#99
BBetteroff
opened
8 months ago
4
Is there a specific meaning of the snapshot id?
#98
zijwang
closed
8 months ago
2
possibly missing shard from host
#97
sagnak
closed
8 months ago
2
What is the output of `run_lsh.py`?
#96
virendrakabra14
closed
8 months ago
8
Are shards randomly created?
#95
virendrakabra14
closed
8 months ago
1
Impossible unpack tail data... took time to download, but impossible to unpack dataset without quality signals with broken link.
#94
RuslanKovalyov
closed
8 months ago
1
Other language data
#93
Dzg0309
opened
9 months ago
4
Thresholds for all quality signals
#92
torshie
opened
9 months ago
2
Train a new wikiref model
#91
torshie
closed
8 months ago
1
Local data
#90
mauriceweber
closed
10 months ago
0
Low Data Downloading Speed
#89
lipingtang17
closed
8 months ago
1
Token counts
#88
timsueberkrueb
opened
10 months ago
2
where should I go to get the file about "domain_to_category_id.json"?
#87
suolyer
closed
10 months ago
0
regarding to quality classifier
#86
kimcando
opened
10 months ago
2
Update README.md
#85
mauriceweber
closed
10 months ago
0
Deduplicated version of RedPajama-v2
#84
joao-alves97
closed
8 months ago
4
Request: Enable artifact prep on local data
#83
hicotton02
closed
10 months ago
1
Invalid argument when running cc_net
#82
Practicinginhell
opened
10 months ago
2
How is the SHA1 digest computed?
#81
RicardoDominguez
closed
10 months ago
2
Executing V2 issues
#80
hicotton02
opened
11 months ago
6
regarding to deduplication
#79
kimcando
opened
11 months ago
6
cc_net processing local wet file
#78
hicotton02
closed
11 months ago
1
quality_signals, minhash and duplicates missing for tail
#77
Sheshansh
closed
10 months ago
1
New Features
#76
zhangce
opened
11 months ago
0
doc(README): remove typo
#75
Deep145757
closed
11 months ago
1
Issue on book datasets download
#74
beccabai
opened
11 months ago
2
fixes minor typo in data prep README
#73
jspeis
opened
1 year ago
0
cc-net failure on slurm cluster
#72
hicotton02
closed
11 months ago
0
Specifying arxiv dates
#71
matthieumeeus
opened
1 year ago
1
What does "default" do in `load_dataset('togethercomputer/RedPajama-Data-1T', "default")`?
#70
brando90
opened
1 year ago
3
Q: Why does RePajama exist? what problem are you solving?
#69
brando90
opened
1 year ago
1
Next