Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k
stars
284
forks
source link
Reading Parquet files stored on s3 using petastorm generates connection warnings #376
I have a Tensorflow model that I would like to feed with parquet files stored on s3. I'm using petastorm to query these files from s3 and the result of the query is stored as a Tensorflow dataset thanks to petastorm.tf_utils.make_petastorm_dataset.
import s3fs
from pyarrow.filesystem import S3FSWrapper
from petastorm.reader import Reader
from petastorm.tf_utils import make_petastorm_dataset
dataset_url = "analytics.xxx.xxx" #s3 bucket name
fs = s3fs.S3FileSystem()
wrapped_fs = S3FSWrapper(fs)
with Reader(pyarrow_filesystem=wrapped_fs, dataset_path=dataset_url) as reader:
dataset = make_petastorm_dataset(reader)
This works pretty well, except that it generates 20+ lines of connection warnings:
W0514 18:56:42.779965 140231344908032 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.782773 140231311337216 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.854569 140232468973312 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.868761 140231328122624 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.885518 140230816429824 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
...
I have a Tensorflow model that I would like to feed with parquet files stored on s3. I'm using
petastorm
to query these files from s3 and the result of the query is stored as a Tensorflow dataset thanks topetastorm.tf_utils.make_petastorm_dataset
.This works pretty well, except that it generates 20+ lines of connection warnings:
According to this thread https://stackoverflow.com/questions/53765366/urllib3-connectionpool-connection-pool-is-full-discarding-connection, it's certainly related to
urllib3
, but I can't figure a way to get rid of these warnings.