lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
317 stars 105 forks source link

[Enhancement] Allow download_files for AWS to use multipart download #1208

Closed abourramouss closed 10 months ago

abourramouss commented 11 months ago

There is a small incovinient when using download_files for aws, i would like to use the multipart download feature of the download files, as specified by the documentation.

download_file(bucket, key, file_name=None, extra_args={})
Download a file from the storage backend. (Multipart download)

Parameters
bucket (str) – Name of the bucket

key (str) – Key of the object

file_name (Optional[str]) – Name of the file to save the object data

extra_args (Optional[Dict]) – Extra get arguments to be passed to the underlying backend implementation (dict).

Returns
Object, as a binary array or as a file-like stream if parameter stream is enabled

Return type
Union[str, bytes, TextIO, BinaryIO]

But the official boto3 allows us to specify a transfer config to enable multipart download, i think by default there is a transfer config already specified, but i would like to modify it if possible

import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
Similar behavior as S3Transfer’s download_file() method, except that parameters are capitalized. Detailed examples can be found at S3Transfer’s Usage.

PARAMETERS:
Bucket (str) – The name of the bucket to download from.

Key (str) – The name of the key to download from.

Filename (str) – The path to the file to download to.

ExtraArgs (dict) – Extra arguments that may be passed to the client operation. For allowed download arguments see boto3.s3.transfer.S3Transfer.ALLOWED_DOWNLOAD_ARGS.

Callback (function) – A method which takes a number of bytes transferred to be periodically called during the download.

Config ([boto3.s3.transfer.TransferConfig](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig)) – The transfer configuration to be used when performing the transfer.

I would like to send the Config object to the download_file wrapper, to allow multipart download:


def download_file(
        self, read_path: S3Path, base_path: PosixPath = PosixPath("/tmp")
    ) -> PosixPath:
        """Download a file from S3 and returns the local path."""
        if isinstance(read_path, S3Path):
            try:
                local_path = s3_to_local_path(read_path, base_local_dir=str(base_path))
                os.makedirs(local_path.parent, exist_ok=True)
                transfer_config = TransferConfig(
                    multipart_threshold=8 * 1024 * 1024,  # 8 MB
                    max_concurrency=10,
                    multipart_chunksize=8 * 1024 * 1024,  # 8 MB
                    num_download_attempts=5,
                    max_io_queue=100,
                    io_chunksize=262144,  # 256 KB
                    use_threads=True,
                )

                # Pass the TransferConfig object to the download_file method using the lithops wrapper
                self.storage.download_file(
                    read_path.bucket,
                    read_path.key,
                    str(local_path),
                    Config=transfer_config,
                )
                return PosixPath(local_path)
            except Exception as e:
                print(f"Failed to download file {read_path.key}: {e}")
JosepSampe commented 10 months ago

I included the "config" parameter in #1209

abourramouss commented 10 months ago

Thanks @JosepSampe, closing!