uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Multithreaded metadata discovery in ParquetDataset may cause deadlock #604

Open dmcguire81 opened 4 years ago

dmcguire81 commented 4 years ago

I've document two cases of pyarrow.parquet.ParquetDataset exhibiting deadlock, #590 when using S3 and s3fs (later refined to remove the pyarrow.filesystem.S3FSWrapper) and ARROW-10029 when using local storage and pyarrow.filesystem.LocalFileSystem.

The remaining gcs and hdfs use cases should probably be checked against pyarrow==1.0.1, and a work-around should be considered (downgrading the version or disabling multi-threading), if the problem is pervasive.