# run parameters
DATA_ROOT="/mnt/nfs_mount/cc_net/mined_split/"
ARTIFACTS_ID="nordic_pile_v2_rpv2"
INPUT_BASE_URI="file://${DATA_ROOT}/"
OUTPUT_BASE_URI="file://${DATA_ROOT}/nordic_pile_v2_rpv2_output"
MAX_DOCS=-1
LANGUAGES=("sv" "no" "da" "is")
# filename keep filters
FILENAME_KEEP_PATTERNS=(
".*/[a-z]{2}_middle\.json\.gz"
".*/[a-z]{2}_head\.json\.gz"
)
# General parameters used across steps
S3_ENDPOINT_URL=""
S3_BUCKET=""
S3_CCNET_PREFIX=""
S3_PROFILE=""
# Docker
DOCKER_S3_ENDPOINT_URL=""
DOCKER_MNT_DIR="/mnt/data"
DOCKER_REPO="redpajama_pipeline:1.0"
# Dedupe
MINHASH_NGRAM_SIZE="13"
MINHASH_NUM_PERMUTATIONS="128"
MINHASH_SIMILARITIES=(1.0 0.9 0.8 0.7)
# DSIR
DSIR_NUM_SAMPLES=500000
DSIR_FEATURE_DIM=10000
# Classifiers
CLASSIFIERS_NUM_SAMPLES=75000
# sampling for books artifacts
MAX_SAMPLES_PER_BOOK=1000
MAX_PARAGRAPHS_PER_BOOK_SAMPLE=250
# Others
INPUTS_PER_PROCESS=20 # the number of files processed by one process at a time
# domain blacklist categories
DOMAIN_BLACKLIST_CATEGORIES=(
"adult"
"porn"
)
# CC snapshot ids to process
CC_SNAPSHOT_IDS=(
"2023-50"
)
Error message:
bash scripts/run_prep_artifacts.sh --config configs/default.conf --listings listings.txt --max_workers 32
Created run id: 992409f7
Writing run id to file /mnt/nfs_mount/cc_net/mined_split/artifacts-992409f7/_RUN_ID
copied listings file from listings.txt to /mnt/nfs_mount/cc_net/mined_split/artifacts-992409f7/listings/listings.txt
__SNAPSHOT_LISTINGS_SUCCESS__ 2023-50
Toal number of listings: 1
__LANG_PREP_START__ sv @ Wed Jan 24 16:46:26 UTC 2024
[2024-01-24 16:46:28,581]::(PID 1)::INFO::Start preparing artifacts for sv
[2024-01-24 16:46:28,581]::(PID 1)::INFO::num_samples: 500000
[2024-01-24 16:46:28,581]::(PID 1)::INFO::PYTHONHASHSEED: 42
[2024-01-24 16:46:28,582]::(PID 1)::INFO::CCNetDownloader(sv) Start loading input listings...
[2024-01-24 16:46:28,583]::(PID 1)::INFO::CCNetDownloader(sv) Partitioning inputs by snapshot...
[2024-01-24 16:46:28,628]::(PID 1)::INFO::CCNetDownloader(sv) Start loading 500000 samples from 1 snapshots
writing progress: 0it [00:00, ?it/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/app/src/utilities/io/reader.py", line 48, in read
with self.__get_filehandle(uri) as fh:
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/app/src/utilities/io/reader.py", line 82, in __get_filehandle
raise ValueError(f"Invalid uri: {uri}; must be of the form "
ValueError: Invalid uri: ParseResult(scheme='', netloc='', path='2023-50/0000/sv_head.json.gz', params='', query='', fragment=''); must be of the form s3://<bucket>/<key> or file://<path>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/app/src/artifacts/downloaders/ccnet_downloader.py", line 183, in _load_snapshot
for idx, record in reader.read(
File "/usr/app/src/utilities/io/reader.py", line 70, in read
raise UnknownReadError(f"unknown __URI_READ_ERROR__ {uri}: "
core.exceptions.UnknownReadError: unknown __URI_READ_ERROR__ 2023-50/0000/sv_head.json.gz: ValueError: Invalid uri: ParseResult(scheme='', netloc='', path='2023-50/0000/sv_head.json.gz', params='', query='', fragment=''); must be of the form s3://<bucket>/<key> or file://<path>
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/app/src/prep_artifacts.py", line 189, in <module>
main(artifacts_dir=args.artifacts_dir,
File "/usr/app/src/prep_artifacts.py", line 117, in main
ccnet.run(logger=logger)
File "/usr/app/src/artifacts/downloaders/ccnet_downloader.py", line 115, in run
counts_per_snapsh = pool.starmap(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 375, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
core.exceptions.UnknownReadError: unknown __URI_READ_ERROR__ 2023-50/0000/sv_head.json.gz: ValueError: Invalid uri: ParseResult(scheme='', netloc='', path='2023-50/0000/sv_head.json.gz', params='', query='', fragment=''); must be of the form s3://<bucket>/<key> or file://<path>
Process Process-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/app/src/artifacts/downloaders/ccnet_downloader.py", line 210, in _writer_worker
data = data_queue.get()
^^^^^^^^^^^^^^^^
File "<string>", line 2, in get
File "/usr/local/lib/python3.11/multiprocessing/managers.py", line 822, in _callmethod
kind, result = conn.recv()
^^^^^^^^^^^
File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 249, in recv
buf = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/multiprocessing/connection.py", line 382, in _recv
raise EOFError
EOFError
writing progress: 0it [00:00, ?it/s]
Error: scripts/run_prep_artifacts.sh:7: command `docker run --env AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" --env AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" -v "${DATA_ROOT%/}":"${DOCKER_MNT_DIR%/}" -t "${DOCKER_REPO}" python3 src/prep_artifacts.py --artifacts_dir "${ARTIFACTS_DIR%/}" --cc_input "${ARTIFACTS_DIR%/}/listings/listings.txt" --cc_input_base_uri "${S3_BUCKET%/}${S3_CCNET_PREFIX%/}" --cache_dir "${DOCKER_MNT_DIR%/}/.hf_cache" --lang "${lang}" --max_workers "${MAX_WORKERS}" --endpoint_url "$DOCKER_S3_ENDPOINT_URL" --dsir_num_samples "${DSIR_NUM_SAMPLES}" --dsir_feature_dim "${DSIR_FEATURE_DIM}" --classifiers_num_samples "${CLASSIFIERS_NUM_SAMPLES}" --max_paragraphs_per_book_sample "${MAX_PARAGRAPHS_PER_BOOK_SAMPLE}" --max_samples_per_book "${MAX_SAMPLES_PER_BOOK}"` failed with exit code 1
However, it cannot find the the file "2023-50/0000/sv_head.json.gz" i tried adding ${DATA_ROOT} to the S3_CCNET_PREFIX since it seems to use that prefix when looking for the file. But for now S3_CCNET_PREFIX is set to "" since all my data is locally stored.
This file 2023-50/0000/sv_head.json.gz is stored here /mnt/nfs_mount/cc_net/mined_split/ which is specified in the ${DATA_ROOT}...
I am trying to run:
And this is my config/default.conf:
Error message:
However, it cannot find the the file "2023-50/0000/sv_head.json.gz" i tried adding ${DATA_ROOT} to the
S3_CCNET_PREFIX
since it seems to use that prefix when looking for the file. But for nowS3_CCNET_PREFIX
is set to""
since all my data is locally stored.This file
2023-50/0000/sv_head.json.gz
is stored here/mnt/nfs_mount/cc_net/mined_split/
which is specified in the ${DATA_ROOT}...And the
listings.txt
looks like this: