This came about from ticket INC0628305. Basic overview:
User is uploading many TB of data from Cheaha to LTS. Using s5cmd for upload due to speed compared to rclone, s3cmd, and AWS CLI.
s5cmd silently checks md5 checksums during upload, but doesn't have an option to print those verbosely or to recheck after upload for peace of mind for the researcher
rclone has an option to just check checksums and report files with differences between the source and target.
Data have the following properties:
Bucket is owned by the lab group
Files are uploaded by the researcher's personal account using s5cmd
The researcher begins using rclone to check test files but is getting an error saying there were no files found.
The credentials for the rclone account are for the lab group, not her personal account
This error was not reproducible by RC staff who were given full permissions on all files in the bucket
Eventually, it was noticed that the researcher was using the lab account's credentials for their rclone profile as mentioned above. The lab account owns the bucket but was not mentioned in the policy file. It was assumed that the bucket owner retained admin-esque (read, write, delete) permissions on all files in the bucket regardless of which account uploaded them. This seems like it may be the case for s5cmd, but rclone assumes a more Posix standard set of permissions where the owner of a folder cannot interact with files in the folder they don't own without explicit permissions. This should be expanded upon somewhere. Possible suggestions for action:
Clarify rclone's assumption of owner permissions on files in their bucket compared to other tools. Need to test s3cmd, s5cmd, and AWS CLI first
Add suggestion in policy file documentation to set permissions for the bucket owner explicitly. All other permissions granted to other users would be kept separate from the owner permissions
Alternatively, move away from rclone in the s3 documentation. s5cmd is much faster for large datasets and files. Having multiple tools is nice for researchers to have some choice, but facilitators then need to keep track of the details for each tool if issues arise. Globus should most likely be the first option for general transfers. s5cmd can be used for faster transfers of very large datasets and where automation in scripts is desired.
What would you like to see added?
This came about from ticket INC0628305. Basic overview:
s5cmd
for upload due to speed compared torclone
,s3cmd
, and AWS CLI.s5cmd
silently checks md5 checksums during upload, but doesn't have an option to print those verbosely or to recheck after upload for peace of mind for the researcherrclone
has an option to just check checksums and report files with differences between the source and target.s5cmd
rclone
to check test files but is getting an error saying there were no files found.rclone
account are for the lab group, not her personal accountEventually, it was noticed that the researcher was using the lab account's credentials for their
rclone
profile as mentioned above. The lab account owns the bucket but was not mentioned in the policy file. It was assumed that the bucket owner retained admin-esque (read, write, delete) permissions on all files in the bucket regardless of which account uploaded them. This seems like it may be the case fors5cmd
, butrclone
assumes a more Posix standard set of permissions where the owner of a folder cannot interact with files in the folder they don't own without explicit permissions. This should be expanded upon somewhere. Possible suggestions for action:Alternatively, move away from rclone in the s3 documentation.
s5cmd
is much faster for large datasets and files. Having multiple tools is nice for researchers to have some choice, but facilitators then need to keep track of the details for each tool if issues arise. Globus should most likely be the first option for general transfers.s5cmd
can be used for faster transfers of very large datasets and where automation in scripts is desired.