Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k
stars
285
forks
source link
Multithreaded metadata discovery in ParquetDataset may cause deadlock #604
I've document two cases of pyarrow.parquet.ParquetDataset exhibiting deadlock, #590 when using S3 and s3fs (later refined to remove the pyarrow.filesystem.S3FSWrapper) and ARROW-10029 when using local storage and pyarrow.filesystem.LocalFileSystem.
The remaining gcs and hdfs use cases should probably be checked against pyarrow==1.0.1, and a work-around should be considered (downgrading the version or disabling multi-threading), if the problem is pervasive.
I've document two cases of
pyarrow.parquet.ParquetDataset
exhibiting deadlock, #590 when using S3 ands3fs
(later refined to remove thepyarrow.filesystem.S3FSWrapper
) and ARROW-10029 when using local storage andpyarrow.filesystem.LocalFileSystem
.The remaining
gcs
andhdfs
use cases should probably be checked againstpyarrow==1.0.1
, and a work-around should be considered (downgrading the version or disabling multi-threading), if the problem is pervasive.