togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Invalid uri: ParseResult(...) must be of the form s3://<bucket>/<key> or file://<path> #101

Closed timpal0l closed 5 months ago

timpal0l commented 5 months ago

I am trying to run:

. configs/default.conf
cd app
docker build -t "redpajama_pipeline:1.0" .

bash scripts/run_prep_artifacts.sh \
  --config configs/default.conf \
  --listings listings.txt\
  --max_workers 32

And this is my config/default.conf:

  # run parameters
DATA_ROOT="/mnt/nfs_mount/cc_net/mined_split/"
ARTIFACTS_ID="nordic_pile_v2_rpv2"
INPUT_BASE_URI="file://${DATA_ROOT}/"
OUTPUT_BASE_URI="file://${DATA_ROOT}/nordic_pile_v2_rpv2_output"
MAX_DOCS=-1
LANGUAGES=("sv" "no" "da" "is")

# filename keep filters
FILENAME_KEEP_PATTERNS=(
".*/[a-z]{2}_middle\.json\.gz"
".*/[a-z]{2}_head\.json\.gz"
)

# General parameters used across steps
S3_ENDPOINT_URL=""
S3_BUCKET=""
S3_CCNET_PREFIX=""
S3_PROFILE=""

# Docker
DOCKER_S3_ENDPOINT_URL=""
DOCKER_MNT_DIR="/mnt/data"
DOCKER_REPO="redpajama_pipeline:1.0"

# Dedupe
MINHASH_NGRAM_SIZE="13"
MINHASH_NUM_PERMUTATIONS="128"
MINHASH_SIMILARITIES=(1.0 0.9 0.8 0.7)

# DSIR
DSIR_NUM_SAMPLES=500000
DSIR_FEATURE_DIM=10000

# Classifiers
CLASSIFIERS_NUM_SAMPLES=75000

# sampling for books artifacts
MAX_SAMPLES_PER_BOOK=1000
MAX_PARAGRAPHS_PER_BOOK_SAMPLE=250

# Others
INPUTS_PER_PROCESS=20 # the number of files processed by one process at a time

# domain blacklist categories
DOMAIN_BLACKLIST_CATEGORIES=(
"adult"
"porn"
)

# CC snapshot ids to process
CC_SNAPSHOT_IDS=(
"2023-50"
)

Error message:

bash scripts/run_prep_artifacts.sh   --config configs/default.conf   --listings listings.txt  --max_workers 32
Created run id: 992409f7
Writing run id to file /mnt/nfs_mount/cc_net/mined_split/artifacts-992409f7/_RUN_ID
copied listings file from listings.txt to /mnt/nfs_mount/cc_net/mined_split/artifacts-992409f7/listings/listings.txt
__SNAPSHOT_LISTINGS_SUCCESS__ 2023-50
Toal number of listings: 1
__LANG_PREP_START__ sv @ Wed Jan 24 16:46:26 UTC 2024
[2024-01-24 16:46:28,581]::(PID 1)::INFO::Start preparing artifacts for sv
[2024-01-24 16:46:28,581]::(PID 1)::INFO::num_samples: 500000
[2024-01-24 16:46:28,581]::(PID 1)::INFO::PYTHONHASHSEED: 42
[2024-01-24 16:46:28,582]::(PID 1)::INFO::CCNetDownloader(sv) Start loading input listings...
[2024-01-24 16:46:28,583]::(PID 1)::INFO::CCNetDownloader(sv) Partitioning inputs by snapshot...
[2024-01-24 16:46:28,628]::(PID 1)::INFO::CCNetDownloader(sv) Start loading 500000 samples from 1 snapshots
writing progress: 0it [00:00, ?it/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/app/src/utilities/io/reader.py", line 48, in read
    with self.__get_filehandle(uri) as fh:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/app/src/utilities/io/reader.py", line 82, in __get_filehandle
    raise ValueError(f"Invalid uri: {uri}; must be of the form "
ValueError: Invalid uri: ParseResult(scheme='', netloc='', path='2023-50/0000/sv_head.json.gz', params='', query='', fragment=''); must be of the form s3://<bucket>/<key> or file://<path>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/app/src/artifacts/downloaders/ccnet_downloader.py", line 183, in _load_snapshot
    for idx, record in reader.read(
  File "/usr/app/src/utilities/io/reader.py", line 70, in read
    raise UnknownReadError(f"unknown __URI_READ_ERROR__ {uri}: "
core.exceptions.UnknownReadError: unknown __URI_READ_ERROR__ 2023-50/0000/sv_head.json.gz: ValueError: Invalid uri: ParseResult(scheme='', netloc='', path='2023-50/0000/sv_head.json.gz', params='', query='', fragment=''); must be of the form s3://<bucket>/<key> or file://<path>
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/app/src/prep_artifacts.py", line 189, in <module>
    main(artifacts_dir=args.artifacts_dir,
  File "/usr/app/src/prep_artifacts.py", line 117, in main
    ccnet.run(logger=logger)
  File "/usr/app/src/artifacts/downloaders/ccnet_downloader.py", line 115, in run
    counts_per_snapsh = pool.starmap(
                        ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 375, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
core.exceptions.UnknownReadError: unknown __URI_READ_ERROR__ 2023-50/0000/sv_head.json.gz: ValueError: Invalid uri: ParseResult(scheme='', netloc='', path='2023-50/0000/sv_head.json.gz', params='', query='', fragment=''); must be of the form s3://<bucket>/<key> or file://<path>
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/app/src/artifacts/downloaders/ccnet_downloader.py", line 210, in _writer_worker
    data = data_queue.get()
           ^^^^^^^^^^^^^^^^
  File "<string>", line 2, in get
  File "/usr/local/lib/python3.11/multiprocessing/managers.py", line 822, in _callmethod
    kind, result = conn.recv()
                   ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 249, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 382, in _recv
    raise EOFError
EOFError
writing progress: 0it [00:00, ?it/s]
Error: scripts/run_prep_artifacts.sh:7: command `docker run --env AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" --env AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" -v "${DATA_ROOT%/}":"${DOCKER_MNT_DIR%/}" -t "${DOCKER_REPO}" python3 src/prep_artifacts.py --artifacts_dir "${ARTIFACTS_DIR%/}" --cc_input "${ARTIFACTS_DIR%/}/listings/listings.txt" --cc_input_base_uri "${S3_BUCKET%/}${S3_CCNET_PREFIX%/}" --cache_dir "${DOCKER_MNT_DIR%/}/.hf_cache" --lang "${lang}" --max_workers "${MAX_WORKERS}" --endpoint_url "$DOCKER_S3_ENDPOINT_URL" --dsir_num_samples "${DSIR_NUM_SAMPLES}" --dsir_feature_dim "${DSIR_FEATURE_DIM}" --classifiers_num_samples "${CLASSIFIERS_NUM_SAMPLES}" --max_paragraphs_per_book_sample "${MAX_PARAGRAPHS_PER_BOOK_SAMPLE}" --max_samples_per_book "${MAX_SAMPLES_PER_BOOK}"` failed with exit code 1

However, it cannot find the the file "2023-50/0000/sv_head.json.gz" i tried adding ${DATA_ROOT} to the S3_CCNET_PREFIX since it seems to use that prefix when looking for the file. But for now S3_CCNET_PREFIX is set to "" since all my data is locally stored.

This file 2023-50/0000/sv_head.json.gz is stored here /mnt/nfs_mount/cc_net/mined_split/ which is specified in the ${DATA_ROOT}...

And the listings.txt looks like this:

2023-50/0000/sv_head.json.gz