uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Deadlock in multithreaded Parquet metadata discovery #590

Closed dmcguire81 closed 4 years ago

dmcguire81 commented 4 years ago

Summary

Interaction between Petastorm and S3FS seems to be unusable, and it's unclear what level of testing and exercise this has gotten within the Petastorm project, itself, and the wider community, because basic operations (make_reader, make_batch_reader) simply don't work at all. Breakdowns in the interaction could fall anywhere between this project, pyarrow (for S3FSWrapper) and s3fs, but we're consuming Petastorm, directly, so starting here.

Tested Versions

This had to be tested on an earlier version of s3fs (0.4.2), because the more current versions (>=0.5.0) have a different problem with the aiobotocore wrapper leaking async coroutines into Petastorm. There will be a separate defect for that, and we'll take both to that project.

Repro Test Case

Setup

pip install petastorm==0.9.5
pip install s3fs==0.4.2

Test

import pyarrow.parquet as pq
from petastorm.fs_utils import get_filesystem_and_path_or_paths, normalize_dir_url

dataset_url = 's3://<redacted>'

# Repeat basic steps that make_reader or make_batch_reader normally does
dataset_url = normalize_dir_url(dataset_url)
fs, path = get_filesystem_and_path_or_paths(dataset_url)

# Finished in seconds
dataset = pq.ParquetDataset(path, filesystem=fs, metadata_nthreads=1)
# Hung all night
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, metadata_nthreads=10)
selitvin commented 4 years ago

What is the pyarrow version you use that exhibits this hanging? Does using a different pyarrow version solves this issue?

dmcguire81 commented 4 years ago

I didn't have to specify the pyarrow version - I'm assuming it's whatever is installed as a dependency of petatstorm==0.9.5, but I'll check.

dmcguire81 commented 4 years ago

Looks like pyarrow==1.0.1.

dmcguire81 commented 4 years ago

I was able to get a repro test case that was 100% pyarrow, so I'll close this. However, I would expect the impact to be fairly pervasive if it includes all storage protocols, so it would be a good idea to separately track a work-around (perhaps downgrading the version of pyarrow), if anyone else sees similar problems.

selitvin commented 4 years ago

Thank you for the investigation. It's a nasty one. I would guess 0.15.1 is also impacted given our CI hangs from time to time with similar symptoms and it uses pyarrow 0.15.1.