Open hicotton02 opened 8 months ago
I can verify that len(inputs_by_snapsh) is 0
edit: I seem to not have the listings correct, or the s3 bucket info correct. is it possible to get an example of a listsings.txt and
S3_ENDPOINT_URL="https://red-pajama.s3.us-east-1.amazonaws.com" S3_BUCKET="red-pajama" S3_CCNET_PREFIX="/rs_cc_net" S3_PROFILE="default"
DOCKER_S3_ENDPOINT_URL="https://red-pajama.s3.us-east-1.amazonaws.com" DOCKER_MNT_DIR="/mnt/data" DOCKER_REPO="theskaz/red-pajama"
does this look right?
Hi @hicotton02 , thanks for your question!
default.conf is missing lines for the AWS Secret and ID We deliberately left these out since so that users specify these via
export AWS_SECRET...
(reduces the risk of uploading access keys to github).
Once the environment variables are specified, you can get the listings via
s5cmd --profile "$S3_PROFILE" --endpoint-url "$S3_ENDPOINT_URL" \
ls "${S3_BUCKET%/}${S3_CCNET_PREFIX%/}/*" | grep "\.json\.gz$" | awk '{print $NF}' >"${LISTINGS_FILE}"
what should produce a file with contents of the form:
2014-15/0000/en_head.json.gz
2014-15/0000/en_middle.json.gz
2014-15/0001/en_head.json.gz
2014-15/0001/en_middle.json.gz
Let me know if this helps!:)
Thank you so much for the response.
Is the s5cmd command supposed to point to my own s3 bucket or someone else's?
I created a bucket but it is blank at this time. I remember in V1 we were downloading info from I think Arxiv's bucket.
edit: As part of this workstream, do we download ccnet data separately (I see their repo went archive)?
There is no data that needs to be pulled from an external S3 bucket, only your own where you have the ccnet output stored -- it is also only required to create the artifacts. Are you creating your own artifacts for a custom dataset or are you trying to reproduce the quality signals we have provided?
You can download the ccnet out from the public urls (https://data.together.xyz/redpajama-data-v2/v1.0.0/) and then upload it to your own S3 bucket. Also check out the huggingface repo here which contains instructions on how to download the data.
There is no data that needs to be pulled from an external S3 bucket, only your own where you have the ccnet output stored -- it is also only required to create the artifacts. Are you creating your own artifacts for a custom dataset or are you trying to reproduce the quality signals we have provided?
You can download the ccnet out from the public urls (https://data.together.xyz/redpajama-data-v2/v1.0.0/) and then upload it to your own S3 bucket. Also check out the huggingface repo here which contains instructions on how to download the data.
I am going to end up doing both. Right now I am just learning how you guys did all this. Once I have that done, and have some sort of understanding on what is going on, I want to add and remove data to see how that affects everything. I am in my master's for AI/ML and using this to learn in addition to what I am learning in school.
Wrote a python script to download all the ccnet data based on your links above. it does this in parallel and is basic. saturated my connection and server to get the most efficient process going.
import os
import subprocess
import multiprocessing as mp
CC_SNAPSHOT_IDS = [
"2014-15",
"2014-23",
"2014-35",
"2014-41",
"2014-42",
"2014-49",
"2014-52",
"2015-14",
"2015-22",
"2015-27",
"2015-32",
"2015-35",
"2015-40",
"2015-48",
"2016-07",
"2016-18",
"2016-22",
"2016-26",
"2016-30",
"2016-36",
"2016-40",
"2016-44",
"2016-50",
"2017-04",
"2017-09",
"2017-17",
"2017-22",
"2017-26",
"2017-30",
"2017-34",
"2017-39",
"2017-43",
"2017-47",
"2017-51",
"2018-05",
"2018-09",
"2018-13",
"2018-17",
"2018-22",
"2018-26",
"2018-30",
"2018-34",
"2018-39",
"2018-43",
"2018-47",
"2018-51",
"2019-04",
"2019-09",
"2019-13",
"2019-18",
"2019-22",
"2019-26",
"2019-30",
"2019-35",
"2019-39",
"2019-43",
"2019-47",
"2019-51",
"2020-05",
"2020-10",
"2020-16",
"2020-24",
"2020-29",
"2020-34",
"2020-40",
"2020-45",
"2020-50",
"2021-04",
"2021-10",
"2021-17",
"2021-21",
"2021-25",
"2021-31",
"2021-39",
"2021-43",
"2021-49",
"2022-05",
"2022-21",
"2022-27",
"2022-33",
"2022-40",
"2022-49",
"2023-06",
"2023-14"
]
def download_snapshot(snapshot_id, semaphore):
with semaphore:
LANG = "en"
BASE_URL = "https://data.together.xyz/redpajama-data-v2/v1.0.0"
PARTITION = "head_middle"
listings_tag = f"{LANG}-{snapshot_id}-{PARTITION}"
os.makedirs("listings", exist_ok=True)
subprocess.run(["wget", f"{BASE_URL}/listings/{listings_tag}.txt", "-O", f"listings/{listings_tag}.txt"])
with open(f"listings/{listings_tag}.txt", "r") as listings_file:
for line in listings_file:
line = line.strip()
url = f"{BASE_URL}/documents/{line}.json.gz"
dest = f"documents/{line}.json.gz"
os.makedirs(os.path.dirname(dest), exist_ok=True)
subprocess.run(["wget", url, "-O", dest])
url = f"{BASE_URL}/quality_signals/{line}.signals.json.gz"
dest = f"quality_signals/{line}.signals.json.gz"
os.makedirs(os.path.dirname(dest), exist_ok=True)
subprocess.run(["wget", url, "-O", dest])
COMPS = ["minhash", "duplicates"]
for comp in COMPS:
listings_file.seek(0)
for line in listings_file:
line = line.strip()
url = f"{BASE_URL}/{comp}/{line}.{comp}.parquet"
dest = f"{comp}/{line}.{comp}.parquet"
os.makedirs(os.path.dirname(dest), exist_ok=True)
subprocess.run(["wget", url, "-O", dest])
if __name__ == "__main__":
os.chdir("/nfs/slow/data/ccnet")
semaphore = mp.Semaphore(mp.cpu_count())
processes = [mp.Process(target=download_snapshot, args=(i, semaphore)) for i in CC_SNAPSHOT_IDS]
for process in processes:
process.start()
for process in processes:
process.join()
print("All Threads Completed", flush=True)
Since the new version came out, I have been trying to get things working. Here are a couple issues that I ran into, and resolved:
needed s5cmd so had to install conda then s5cmd installed docker rootless although networking is unavailable, so for now, running docker as root default.conf is missing lines for the AWS Secret and ID. Added them no problem.
when running the
I modified to run in my environment (Ubuntu 22.04 WSL2):
I get the following error:
is my listing parameter correct? or is there some other issue