Open wwarriner opened 1 year ago
Single large file parallelization test:
#!/bin/bash
start_time="$(date -u +%s.%N)"
s5cmd --stat \
--numworkers=$SLURM_CPUS_ON_NODE \
--endpoint-url=https://s3.lts.rc.uab.edu/ \
cp \
--concurrency $SLURM_CPUS_ON_NODE \
SOURCE_PATH \
s3://DESTINATION_PATH/
end_time="$(date -u +%s.%N)"
elapsed="$(bc <<<"$end_time-$start_time")"
echo "Total of $elapsed seconds elapsed for process"
Checksum verification with rclone
.
rclone
to work with LTS with rclone config
creating an lts
endpoint pointing to s3.lts.rc.uab.edu
. See: https://docs.rc.uab.edu/data_management/transfer/rclone/#setting-up-an-s3-lts-remote. The name in the docs may be Ceph
instead of lts
.mkdir ~/rclone-check-test
rclone copy lts:site-test ~/rclone-check-test
rclone check ~/rclone-check-test lts:site-test
You should see lines like the following after step 4.
2023/06/29 15:34:51 NOTICE: S3 bucket site-test: 0 differences found
2023/06/29 15:34:51 NOTICE: S3 bucket site-test: 2 matching files
Note that the bucket site-test
is publicly available, containing an example static website. Go to https://s3.lts.rc.uab.edu/site-test/index.html
to visit the page. It is possible to mimick this use case using #566
We should show examples with sync too. That was really easy to use to move a whole tree.
How did you set up credentials? Env vars? ~/.aws/credentials ?
Great question!
For all S3-related activities, I put the credentials in env vars only usable within that session. It can be a bit of a pain but it's more secure than storing them in plaintext. The Secret Access Key should be treated with the same level of security you would give to any other password, because that is its functional purpose.
I also put an env var for the endpoint url for convenience. This means I don't have to use --endpoint-url
on every command. Both methods are valid alternatives.
# module load awscli # not nearly as fast as s5cmd
#
# _OR_
#
# module load Anaconda3
# conda activate s5cmd # which you've already created separately
export AWS_ACCESS_KEY_ID=$your_access_key
export AWS_SECRET_ACCESS_KEY=$your_secret_access_key
export AWS_ENDPOINT_URL=https://s3.lts.rc.uab.edu/ # haven't tested this with s5cmd
# do what you need to do here
Docs for s5cmd
here: https://github.com/peak/s5cmd#specifying-credentials
Detailed info here: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html
And here: https://docs.aws.amazon.com/sdkref/latest/guide/feature-ss-endpoints.html
Just to add a comment here, adding your access keys to the shell script actually makes them somewhat less secure than adding them to a credentials file because the shell scripts are saved in the job script archive, and that archive is accessible for everyone in RC. So setting your keys as environment variables would only be more secure for interactive moves, not batch jobs, and they could be saved in your bash history anyway. There is probably an answer for this somewhere, but I'm not sure saving as plain text in credential files is much less secure than the other options here
Great point. I'm not sure what the best option would be here.
Here is one potential option: https://docs.aws.amazon.com/secretsmanager/latest/userguide/security_cli-exposure-risks.html
Related bash history configuration:
Noticed that the currently available "awscli" modules in Cheaha are outdated and do not recognize the environment variable AWS_ENDPOINT_URL
export AWS_ENDPOINT_URL=https://s3.lts.rc.uab.edu/
Installing the latest "awscli" within a conda environment recognized the variable AWS_ENDPOINT_URL . This was tested with s5cmd and Boto3, a Python library used to manage AWS services like S3 Boto3 documentation.
awscli
is installable on an individual basis. The module should be removed and replaced with instructions on how it should be installed if someone needs it on our docs.
I see the conda part now, sorry I should read the whole message before responding :)
What would you like to see added?
Caveat!
Our understanding is that
s5cmd
usesmd5
hashes to verify binary content integrity during uploads only, not downloads. For more intricate verification another tool will be required (e.g. of metadata or using another hash). A later post in this issue documents how to userclone check
.Notes
--stat
shows total files transferred, failed, successful, at the end of the job--numworkers=$SLURM_CPUS_ON_NODE
is perfect for a single-node job--endpoint=https://s3.lts.rc.uab.edu/
is required for our S3 endpointmv
will remove the file from the source!cp
is what we want until we've verified the files on the destinationTests
Tests with 8 cpus and 8 GB memory on c0168:
Tests with 100 cpus and 200 GB memory on c0202 (amd-hdr100)
Example
Sample commands to get timing and s5cmd cp (in a script):
Other thoughts
We don't fully understand the
cp
flag--concurrency
.There are also open questions about the Rados Gateway frontend configuration.
ceph.conf
https://docs.ceph.com/en/latest/radosgw/config-ref/#ceph-object-gateway-config-reference