liormizr / s3path

s3path is a pathlib extension for AWS S3 Service
Apache License 2.0
206 stars 39 forks source link

Add example of buffered copy to docs #177

Open maresb opened 1 month ago

maresb commented 1 month ago

It's an extremely common use case to want to copy a file to/from/between S3 buckets. The dst.write_text(src.read_text()) solution is simple, but not viable for large files.

Luckily, I came across this gem by @gabrieldemarmiesse using shutil.copyfileobj. This works well for all combinations of pathlib.Path and s3path.S3Path as source and destination.

That comment is very difficult to discover, especially when searching for issues. Some more prominent discussions are https://github.com/liormizr/s3path/issues/98 (no solution) and https://github.com/liormizr/s3path/issues/44 (overly-complicated solution). Thus I think we should codify the solution in the documentation to make it easily discoverable.


I tried a few things to optimize this code:

  1. There is also a function called shutil.copyfile that works wtih pathlib.Path objects. Unfortunately this function calls

    open(p, 'rb')

    which fails on s3path.S3Path objects with

    FileNotFoundError: [Errno 2] No such file or directory: '/bucket-name/filename'

    Therefore, it's necessary to open the file handles ourselves.

  2. Note that shutil.copyfileobj has an optional length argument for the buffer size. Its default value is defined as

    COPY_BUFSIZE = 1024 * 1024 if _WINDOWS else 64 * 1024

    A few quick experiments with my setup (Linux with fiber internet) shows that the copy duration is insensitive to length until it drops to the ~1024 range, so I don't think we should suggest modifying this parameter.