uabrc / uabrc.github.io

UAB Research Computing Documentation
https://docs.rc.uab.edu
21 stars 12 forks source link

S3 permission clarifications #594

Closed mdefende closed 10 months ago

mdefende commented 1 year ago

What would you like to see added?

This came about from ticket INC0628305. Basic overview:

  1. User is uploading many TB of data from Cheaha to LTS. Using s5cmd for upload due to speed compared to rclone, s3cmd, and AWS CLI.
  2. s5cmd silently checks md5 checksums during upload, but doesn't have an option to print those verbosely or to recheck after upload for peace of mind for the researcher
  3. rclone has an option to just check checksums and report files with differences between the source and target.
  4. Data have the following properties:
    1. Bucket is owned by the lab group
    2. Files are uploaded by the researcher's personal account using s5cmd
  5. The researcher begins using rclone to check test files but is getting an error saying there were no files found.
    1. The credentials for the rclone account are for the lab group, not her personal account
  6. This error was not reproducible by RC staff who were given full permissions on all files in the bucket

Eventually, it was noticed that the researcher was using the lab account's credentials for their rclone profile as mentioned above. The lab account owns the bucket but was not mentioned in the policy file. It was assumed that the bucket owner retained admin-esque (read, write, delete) permissions on all files in the bucket regardless of which account uploaded them. This seems like it may be the case for s5cmd, but rclone assumes a more Posix standard set of permissions where the owner of a folder cannot interact with files in the folder they don't own without explicit permissions. This should be expanded upon somewhere. Possible suggestions for action:

  1. Clarify rclone's assumption of owner permissions on files in their bucket compared to other tools. Need to test s3cmd, s5cmd, and AWS CLI first
  2. Add suggestion in policy file documentation to set permissions for the bucket owner explicitly. All other permissions granted to other users would be kept separate from the owner permissions

Alternatively, move away from rclone in the s3 documentation. s5cmd is much faster for large datasets and files. Having multiple tools is nice for researchers to have some choice, but facilitators then need to keep track of the details for each tool if issues arise. Globus should most likely be the first option for general transfers. s5cmd can be used for faster transfers of very large datasets and where automation in scripts is desired.

mdefende commented 10 months ago

This was addressed in the LTS FAQ https://docs.rc.uab.edu/data_management/lts/lts_faq/#why-can-i-not-interact-with-a-file-in-my-bucket