`tsd-s3cmd sync --preserve` do not preserve permissions

espenhgn commented 3 months ago

I need some transferred files to retain their "executable" permission from the source location:

$ ls -l GenoPred/pipeline/resources/software/ldsc/
total 164
-rw-rw-r-- 1 <user> <project>  3605 Jun  5 09:48 CHANGELOG
drwxrwsr-x 2 <user> <project>  4096 Jun  5 09:46 ContinuousAnnotations
-rw-rw-r-- 1 <user> <project> 35142 Jun  5 09:46 LICENSE
-rw-rw-r-- 1 <user> <project>  5428 Jun  5 09:46 README.md
-rw-rw-r-- 1 <user> <project>   164 Jun  5 09:48 environment.yml
-rwxrwxr-x 1 <user> <project> 30790 Jun  5 09:46 ldsc.py
-rw-rw-r-- 1 <user> <project> 27306 Jun  5 16:04 ldsc.pyc
drwxrwsr-x 2 <user> <project>  4096 Jun  5 16:04 ldscore
-rwxrwxr-x 1 <user> <project>  3008 Jun  5 09:46 make_annot.py
-rwxrwxr-x 1 <user> <project> 30281 Jun  5 09:48 munge_sumstats.py
-rw-rw-r-- 1 <user> <project>   110 Jun  5 09:46 requirements.txt
-rw-rw-r-- 1 <user> <project>   577 Jun  5 09:46 setup.py
drwxrwsr-x 8 <user> <project>  4096 Jun  5 09:48 test

but on TSD this is removed, even if all files were transferred with tsd-s3cmd sync --recursive --preserve GenoPred s3://espehage-nird/:

According to tsd-s3cmd --s3cmd-help:

-p, --preserve        Preserve filesystem attributes (mode, ownership,
                        timestamps). Default for [sync] command.

haatveit commented 2 months ago

The --s3cmd-help output is coming from s3cmd, which this project provides a simple wrapper around.

I understand that this parameter is confusing, so I will go a bit into the background here. Object storage is very different from POSIX file systems, and does not try to implement its semantics or permissions schemes. When told to preserve permissions, s3cmd will build itself a tool-specific header containing these attributes it sends along with the request to the S3 server, so that they are preserved as metadata. When data from that bucket is copied out again via the S3 protocol, using s3cmd and telling it to --preserve attributes, the metadata will be used for setting file/directory attributes on a POSIX file system.

There's no way to specify on-disk file permissions for data transferred via S3 protocol, because out-of-band access to data transferred to it was not among the goals Amazon had in mind when designing this.

espenhgn commented 2 months ago

Thanks for the clarification. But do you have alternative suggestions for incremental syncing of large file sets within a directory and preserving the POSIX file attributes (other than tar/zipping all files)? To provide a bit of background, GenoPred is a snakemake project that consists of nearly 100GB of files once all download rules are applied within a system with internet access (e.g., SAGA/NIRD).

leondutoit commented 2 months ago

Two options (maybe) for the medium term:

we're working on providing s3 via the IBM storage system, maybe that will make this work
Elixir is working on setting up CVMFS to be able to distribute pipelines across infrastructures efficiently (to me this sounds like the better option)

unioslo / tsd-s3cmd

`tsd-s3cmd sync --preserve` do not preserve permissions #17