uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

The boto config used by s3fs is not parameterizeable when instantiated by Petastorm #593

Closed dmcguire81 closed 4 years ago

dmcguire81 commented 4 years ago

The problem is that their stock configuration uses the legacy retry mode, which is not the recommendation from AWS. Since Petastorm instantiates the s3fs client when using either make_reader or make_batch_reader, there's no opportunity to configure it correctly, and brute-force changes to the ~/.aws/config file impacts any code using boto3, even if unrelated, so there is no fine-grained control to the number of retries by use case.

selitvin commented 4 years ago

Got it. I assume our idea of parametrizing make_read/make_batch_reader with a file-system instance would solve this. RIght?

dmcguire81 commented 4 years ago

The merged code solved this, but parameterizing with a FileSystem might allow me to work-around the deadlock in the interaction between pyarrow and s3fs. As mentioned, I'll pull you into that ongoing conversation.