A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
There is a small incovinient when using download_files for aws, i would like to use the multipart download feature of the download files, as specified by the documentation.
download_file(bucket, key, file_name=None, extra_args={})
Download a file from the storage backend. (Multipart download)
Parameters
bucket (str) – Name of the bucket
key (str) – Key of the object
file_name (Optional[str]) – Name of the file to save the object data
extra_args (Optional[Dict]) – Extra get arguments to be passed to the underlying backend implementation (dict).
Returns
Object, as a binary array or as a file-like stream if parameter stream is enabled
Return type
Union[str, bytes, TextIO, BinaryIO]
But the official boto3 allows us to specify a transfer config to enable multipart download, i think by default there is a transfer config already specified, but i would like to modify it if possible
import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
Similar behavior as S3Transfer’s download_file() method, except that parameters are capitalized. Detailed examples can be found at S3Transfer’s Usage.
PARAMETERS:
Bucket (str) – The name of the bucket to download from.
Key (str) – The name of the key to download from.
Filename (str) – The path to the file to download to.
ExtraArgs (dict) – Extra arguments that may be passed to the client operation. For allowed download arguments see boto3.s3.transfer.S3Transfer.ALLOWED_DOWNLOAD_ARGS.
Callback (function) – A method which takes a number of bytes transferred to be periodically called during the download.
Config ([boto3.s3.transfer.TransferConfig](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html#boto3.s3.transfer.TransferConfig)) – The transfer configuration to be used when performing the transfer.
I would like to send the Config object to the download_file wrapper, to allow multipart download:
def download_file(
self, read_path: S3Path, base_path: PosixPath = PosixPath("/tmp")
) -> PosixPath:
"""Download a file from S3 and returns the local path."""
if isinstance(read_path, S3Path):
try:
local_path = s3_to_local_path(read_path, base_local_dir=str(base_path))
os.makedirs(local_path.parent, exist_ok=True)
transfer_config = TransferConfig(
multipart_threshold=8 * 1024 * 1024, # 8 MB
max_concurrency=10,
multipart_chunksize=8 * 1024 * 1024, # 8 MB
num_download_attempts=5,
max_io_queue=100,
io_chunksize=262144, # 256 KB
use_threads=True,
)
# Pass the TransferConfig object to the download_file method using the lithops wrapper
self.storage.download_file(
read_path.bucket,
read_path.key,
str(local_path),
Config=transfer_config,
)
return PosixPath(local_path)
except Exception as e:
print(f"Failed to download file {read_path.key}: {e}")
There is a small incovinient when using download_files for aws, i would like to use the multipart download feature of the download files, as specified by the documentation.
But the official boto3 allows us to specify a transfer config to enable multipart download, i think by default there is a transfer config already specified, but i would like to modify it if possible
I would like to send the Config object to the download_file wrapper, to allow multipart download: