issues
search
togethercomputer
/
RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k
stars
335
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Exact dedup details
#115
jordane95
opened
3 weeks ago
1
what does the prefix "rps_" mean?
#114
bpwl0121
closed
1 month ago
2
slow transfer speeds from URL sources
#113
axelmagn
opened
2 months ago
5
Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1
#112
konradipipan
opened
3 months ago
1
Inconsistent IDs lead to distributed computing woes.
#111
axelmagn
opened
3 months ago
1
Spanish artifact building error
#110
hicotton02
opened
3 months ago
2
Update README.md
#109
mauriceweber
opened
3 months ago
0
Potential Language Contamination Inquiry
#108
iBibek
opened
3 months ago
1
Step 2) "Invalid option: ---input_base_uri"
#107
timpal0l
opened
3 months ago
1
What purpose cutoff.csv used in the cc_net pipeline?
#106
kemalbastak
opened
4 months ago
2
About the final result
#105
Jdemon233
opened
4 months ago
2
Running the pipeline on cloud or a big data platform
#104
zllai
opened
4 months ago
1
Running full pipeline on a small part of CC
#103
zhentingqi
opened
4 months ago
0
Unavailable Parameters
#102
zhentingqi
opened
4 months ago
0
Invalid uri: ParseResult(...) must be of the form s3://<bucket>/<key> or file://<path>
#101
timpal0l
closed
5 months ago
0
Recommended way to load wget-downloaded data using HF datasets API?
#100
zijwang
opened
5 months ago
1
what's the specific meaning of dsir?
#99
BBetteroff
opened
5 months ago
4
Is there a specific meaning of the snapshot id?
#98
zijwang
closed
5 months ago
2
possibly missing shard from host
#97
sagnak
closed
5 months ago
2
What is the output of `run_lsh.py`?
#96
virendrakabra14
closed
5 months ago
8
Are shards randomly created?
#95
virendrakabra14
closed
5 months ago
1
Impossible unpack tail data... took time to download, but impossible to unpack dataset without quality signals with broken link.
#94
RuslanKovalyov
closed
5 months ago
1
Other language data
#93
Dzg0309
opened
6 months ago
4
Thresholds for all quality signals
#92
torshie
opened
6 months ago
2
Train a new wikiref model
#91
torshie
closed
5 months ago
1
Local data
#90
mauriceweber
closed
6 months ago
0
Low Data Downloading Speed
#89
lipingtang17
closed
5 months ago
1
Token counts
#88
timsueberkrueb
opened
7 months ago
2
where should I go to get the file about "domain_to_category_id.json"?
#87
suolyer
closed
7 months ago
0
regarding to quality classifier
#86
kimcando
opened
7 months ago
2
Update README.md
#85
mauriceweber
closed
7 months ago
0
Deduplicated version of RedPajama-v2
#84
joao-alves97
closed
5 months ago
4
Request: Enable artifact prep on local data
#83
hicotton02
closed
6 months ago
1
Invalid argument when running cc_net
#82
Practicinginhell
opened
7 months ago
2
How is the SHA1 digest computed?
#81
RicardoDominguez
closed
7 months ago
2
Executing V2 issues
#80
hicotton02
opened
7 months ago
6
regarding to deduplication
#79
kimcando
opened
7 months ago
6
cc_net processing local wet file
#78
hicotton02
closed
7 months ago
1
quality_signals, minhash and duplicates missing for tail
#77
Sheshansh
closed
7 months ago
1
New Features
#76
zhangce
opened
7 months ago
0
doc(README): remove typo
#75
Deep145757
closed
7 months ago
1
Issue on book datasets download
#74
beccabai
opened
8 months ago
2
fixes minor typo in data prep README
#73
jspeis
opened
9 months ago
0
cc-net failure on slurm cluster
#72
hicotton02
closed
7 months ago
0
Specifying arxiv dates
#71
matthieumeeus
opened
10 months ago
1
What does "default" do in `load_dataset('togethercomputer/RedPajama-Data-1T', "default")`?
#70
brando90
opened
10 months ago
3
Q: Why does RePajama exist? what problem are you solving?
#69
brando90
opened
10 months ago
1
ArXiv cleaning issue
#68
hicotton02
closed
7 months ago
1
Failed building wheel for cc-net
#67
hicotton02
closed
7 months ago
2
Unlock open science for dataset generation
#66
AbcSxyZ
opened
11 months ago
0
Next