togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Executing V2 issues #80

Open hicotton02 opened 8 months ago

hicotton02 commented 8 months ago

Since the new version came out, I have been trying to get things working. Here are a couple issues that I ran into, and resolved:

needed s5cmd so had to install conda then s5cmd installed docker rootless although networking is unavailable, so for now, running docker as root default.conf is missing lines for the AWS Secret and ID. Added them no problem.

when running the

bash scripts/run_prep_artifacts.sh \
  --config configs/rp_v2.0.conf \
  --listings /path/to/listings/file.txt\
  --max_workers 32

I modified to run in my environment (Ubuntu 22.04 WSL2):

sudo bash scripts/run_prep_artifacts.sh \
  --config configs/default.conf \
  --listings ../data/listings/listing.txt \
  --max_workers 32

I get the following error:

Created run id: 7f15c068
Writing run id to file /nfs/slow/data/artifacts-7f15c068/_RUN_ID
copied listings file from ../data/listings/listing.txt to /nfs/slow/data/artifacts-7f15c068/listings/listings.txt
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-15
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-23
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-35
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-41
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-42
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-49
__SNAPSHOT_LISTINGS_SUCCESS__ 2014-52
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-14
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-27
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-32
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-35
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-40
__SNAPSHOT_LISTINGS_SUCCESS__ 2015-48
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-07
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-18
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-26
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-30
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-36
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-40
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-44
__SNAPSHOT_LISTINGS_SUCCESS__ 2016-50
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-04
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-09
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-17
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-26
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-30
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-34
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-39
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-43
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-47
__SNAPSHOT_LISTINGS_SUCCESS__ 2017-51
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-05
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-09
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-13
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-17
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-26
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-30
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-34
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-39
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-43
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-47
__SNAPSHOT_LISTINGS_SUCCESS__ 2018-51
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-04
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-09
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-13
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-18
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-22
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-26
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-30
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-35
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-39
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-43
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-47
__SNAPSHOT_LISTINGS_SUCCESS__ 2019-51
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-05
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-10
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-16
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-24
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-29
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-34
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-40
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-45
__SNAPSHOT_LISTINGS_SUCCESS__ 2020-50
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-04
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-10
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-17
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-21
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-25
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-31
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-39
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-43
__SNAPSHOT_LISTINGS_SUCCESS__ 2021-49
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-05
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-21
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-27
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-33
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-40
__SNAPSHOT_LISTINGS_SUCCESS__ 2022-49
__SNAPSHOT_LISTINGS_SUCCESS__ 2023-06
__SNAPSHOT_LISTINGS_SUCCESS__ 2023-14
Toal number of listings: 83
__LANG_PREP_START__ en @ Wed Nov  1 12:50:35 MDT 2023
[sudo] password for theskaz:
[2023-11-01 18:50:41,592]::(PID 1)::INFO::Start preparing artifacts for en
[2023-11-01 18:50:41,592]::(PID 1)::INFO::num_samples: 500000
[2023-11-01 18:50:41,592]::(PID 1)::INFO::PYTHONHASHSEED: 42
[2023-11-01 18:50:41,596]::(PID 1)::INFO::CCNetDownloader(en) Start loading input listings...
[2023-11-01 18:50:41,597]::(PID 1)::INFO::CCNetDownloader(en) Partitioning inputs by snapshot...
Traceback (most recent call last):
  File "/usr/app/src/prep_artifacts.py", line 186, in <module>
    main(artifacts_dir=args.artifacts_dir,
  File "/usr/app/src/prep_artifacts.py", line 114, in main
    ccnet.run(logger=logger)
  File "/usr/app/src/artifacts/downloaders/ccnet_downloader.py", line 95, in run
    1, self._num_samples // len(inputs_by_snapsh)
       ~~~~~~~~~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~
ZeroDivisionError: integer division or modulo by zero
Error: scripts/run_prep_artifacts.sh:7: command `sudo docker run --env AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" --env AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" -v "${DATA_ROOT%/}":"${DOCKER_MNT_DIR%/}" -t "${DOCKER_REPO}" python3 src/prep_artifacts.py --artifacts_dir "${ARTIFACTS_DIR%/}" --cc_input "${ARTIFACTS_DIR%/}/listings/listings.txt" --cc_input_base_uri "${S3_BUCKET%/}${S3_CCNET_PREFIX%/}" --cache_dir "${DOCKER_MNT_DIR%/}/.hf_cache" --lang "${lang}" --max_workers "${MAX_WORKERS}" --endpoint_url "$DOCKER_S3_ENDPOINT_URL" --dsir_num_samples "${DSIR_NUM_SAMPLES}" --dsir_feature_dim "${DSIR_FEATURE_DIM}" --classifiers_num_samples "${CLASSIFIERS_NUM_SAMPLES}" --max_paragraphs_per_book_sample "${MAX_PARAGRAPHS_PER_BOOK_SAMPLE}" --max_samples_per_book "${MAX_SAMPLES_PER_BOOK}"` failed with exit code 1

is my listing parameter correct? or is there some other issue

hicotton02 commented 8 months ago

I can verify that len(inputs_by_snapsh) is 0

edit: I seem to not have the listings correct, or the s3 bucket info correct. is it possible to get an example of a listsings.txt and

S3_ENDPOINT_URL="https://red-pajama.s3.us-east-1.amazonaws.com" S3_BUCKET="red-pajama" S3_CCNET_PREFIX="/rs_cc_net" S3_PROFILE="default"

DOCKER_S3_ENDPOINT_URL="https://red-pajama.s3.us-east-1.amazonaws.com" DOCKER_MNT_DIR="/mnt/data" DOCKER_REPO="theskaz/red-pajama"

does this look right?

mauriceweber commented 7 months ago

Hi @hicotton02 , thanks for your question!

default.conf is missing lines for the AWS Secret and ID We deliberately left these out since so that users specify these via export AWS_SECRET... (reduces the risk of uploading access keys to github).

Once the environment variables are specified, you can get the listings via

s5cmd --profile "$S3_PROFILE" --endpoint-url "$S3_ENDPOINT_URL" \
    ls "${S3_BUCKET%/}${S3_CCNET_PREFIX%/}/*" | grep "\.json\.gz$" | awk '{print $NF}' >"${LISTINGS_FILE}"

what should produce a file with contents of the form:

2014-15/0000/en_head.json.gz
2014-15/0000/en_middle.json.gz
2014-15/0001/en_head.json.gz
2014-15/0001/en_middle.json.gz

Let me know if this helps!:)

hicotton02 commented 7 months ago

Thank you so much for the response.

Is the s5cmd command supposed to point to my own s3 bucket or someone else's?

I created a bucket but it is blank at this time. I remember in V1 we were downloading info from I think Arxiv's bucket.

edit: As part of this workstream, do we download ccnet data separately (I see their repo went archive)?

mauriceweber commented 7 months ago

There is no data that needs to be pulled from an external S3 bucket, only your own where you have the ccnet output stored -- it is also only required to create the artifacts. Are you creating your own artifacts for a custom dataset or are you trying to reproduce the quality signals we have provided?

You can download the ccnet out from the public urls (https://data.together.xyz/redpajama-data-v2/v1.0.0/) and then upload it to your own S3 bucket. Also check out the huggingface repo here which contains instructions on how to download the data.

hicotton02 commented 7 months ago

There is no data that needs to be pulled from an external S3 bucket, only your own where you have the ccnet output stored -- it is also only required to create the artifacts. Are you creating your own artifacts for a custom dataset or are you trying to reproduce the quality signals we have provided?

You can download the ccnet out from the public urls (https://data.together.xyz/redpajama-data-v2/v1.0.0/) and then upload it to your own S3 bucket. Also check out the huggingface repo here which contains instructions on how to download the data.

I am going to end up doing both. Right now I am just learning how you guys did all this. Once I have that done, and have some sort of understanding on what is going on, I want to add and remove data to see how that affects everything. I am in my master's for AI/ML and using this to learn in addition to what I am learning in school.

hicotton02 commented 7 months ago

Wrote a python script to download all the ccnet data based on your links above. it does this in parallel and is basic. saturated my connection and server to get the most efficient process going.

import os
import subprocess
import multiprocessing as mp

CC_SNAPSHOT_IDS = [
  "2014-15",
  "2014-23",
  "2014-35",
  "2014-41",
  "2014-42",
  "2014-49",
  "2014-52",
  "2015-14",
  "2015-22",
  "2015-27",
  "2015-32",
  "2015-35",
  "2015-40",
  "2015-48",
  "2016-07",
  "2016-18",
  "2016-22",
  "2016-26",
  "2016-30",
  "2016-36",
  "2016-40",
  "2016-44",
  "2016-50",
  "2017-04",
  "2017-09",
  "2017-17",
  "2017-22",
  "2017-26",
  "2017-30",
  "2017-34",
  "2017-39",
  "2017-43",
  "2017-47",
  "2017-51",
  "2018-05",
  "2018-09",
  "2018-13",
  "2018-17",
  "2018-22",
  "2018-26",
  "2018-30",
  "2018-34",
  "2018-39",
  "2018-43",
  "2018-47",
  "2018-51",
  "2019-04",
  "2019-09",
  "2019-13",
  "2019-18",
  "2019-22",
  "2019-26",
  "2019-30",
  "2019-35",
  "2019-39",
  "2019-43",
  "2019-47",
  "2019-51",
  "2020-05",
  "2020-10",
  "2020-16",
  "2020-24",
  "2020-29",
  "2020-34",
  "2020-40",
  "2020-45",
  "2020-50",
  "2021-04",
  "2021-10",
  "2021-17",
  "2021-21",
  "2021-25",
  "2021-31",
  "2021-39",
  "2021-43",
  "2021-49",
  "2022-05",
  "2022-21",
  "2022-27",
  "2022-33",
  "2022-40",
  "2022-49",
  "2023-06",
  "2023-14"
]

def download_snapshot(snapshot_id, semaphore):
    with semaphore:
        LANG = "en"
        BASE_URL = "https://data.together.xyz/redpajama-data-v2/v1.0.0"
        PARTITION = "head_middle"
        listings_tag = f"{LANG}-{snapshot_id}-{PARTITION}"
        os.makedirs("listings", exist_ok=True)
        subprocess.run(["wget", f"{BASE_URL}/listings/{listings_tag}.txt", "-O", f"listings/{listings_tag}.txt"])

        with open(f"listings/{listings_tag}.txt", "r") as listings_file:
            for line in listings_file:
                line = line.strip()
                url = f"{BASE_URL}/documents/{line}.json.gz"
                dest = f"documents/{line}.json.gz"
                os.makedirs(os.path.dirname(dest), exist_ok=True)
                subprocess.run(["wget", url, "-O", dest])

                url = f"{BASE_URL}/quality_signals/{line}.signals.json.gz"
                dest = f"quality_signals/{line}.signals.json.gz"
                os.makedirs(os.path.dirname(dest), exist_ok=True)
                subprocess.run(["wget", url, "-O", dest])

            COMPS = ["minhash", "duplicates"]
            for comp in COMPS:
                listings_file.seek(0)
                for line in listings_file:
                    line = line.strip()
                    url = f"{BASE_URL}/{comp}/{line}.{comp}.parquet"
                    dest = f"{comp}/{line}.{comp}.parquet"
                    os.makedirs(os.path.dirname(dest), exist_ok=True)
                    subprocess.run(["wget", url, "-O", dest])

if __name__ == "__main__":
    os.chdir("/nfs/slow/data/ccnet")
    semaphore = mp.Semaphore(mp.cpu_count())
    processes = [mp.Process(target=download_snapshot, args=(i, semaphore)) for i in CC_SNAPSHOT_IDS]
    for process in processes:
        process.start()
    for process in processes:
        process.join()
    print("All Threads Completed", flush=True)