uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Bugfix: multithreaded metadata deadlock #592

Closed dmcguire81 closed 4 years ago

dmcguire81 commented 4 years ago

Fixes #590 and #591

codecov[bot] commented 4 years ago

Codecov Report

Merging #592 into master will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #592   +/-   ##
=======================================
  Coverage   85.69%   85.69%           
=======================================
  Files          87       87           
  Lines        4993     4993           
  Branches      794      794           
=======================================
  Hits         4279     4279           
  Misses        578      578           
  Partials      136      136           
Impacted Files Coverage Δ
petastorm/etl/dataset_metadata.py 88.00% <100.00%> (ø)
petastorm/reader.py 90.24% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4b21a70...9229022. Read the comment docs.

dmcguire81 commented 4 years ago

@selitvin how is this PR failing for code coverage in files that it did not change?

selitvin commented 4 years ago

@selitvin how is this PR failing for code coverage in files that it did not change?

Not sure. Some glitch that has to do probably with one of the source files have a line removed.

dmcguire81 commented 4 years ago

Should we make the decision here automatically based on the filesystem used so we don't impact users of other file systems?

We could, but the pyarrow.Filesystem abstraction is basically broken, and it's up to you how much you want to have to maintain different code paths for different filesystems to work around that. Happy to change this or pursue a refactoring.

dmcguire81 commented 4 years ago

Would it look like a defaultdict with 1 entry for the broken filesystem, and a default of 10?

selitvin commented 4 years ago

but the pyarrow.Filesystem abstraction is basically broken

Can you clarify what are you referring to?

Another thought: would introducing serialization in S3FSWrapper guard underlying non-thread-safe code? That would be nicer than having to choose 1 vs 10 (which I agree is a maintenance headache)

dmcguire81 commented 4 years ago

but the pyarrow.Filesystem abstraction is basically broken

Can you clarify what are you referring to?

Sure, just that pyarrow assumes the s3fs code is threadsafe, when it clearly is not. That's they're problem to deal with, but doing the instantiation in petastorm makes it also your problem.

Another thought: would introducing serialization in S3FSWrapper guard underlying non-thread-safe code? That would be nicer than having to choose 1 vs 10 (which I agree is a maintenance headache)

It could, and I think fixing the problem within pyarrow is even better than fixing it within petastorm, but probably not as good as fixing it with s3fs. Unfortunately, I'm having trouble navigating the defect reporting process with Apache Arrow, since it's based on Jira, not, GitHub. It's unclear if I'll need an additional login (and associated legal approval on my side), so if you can navigate that process and see your way through it, that would be great!

dmcguire81 commented 4 years ago

@selitvin the conversation with s3fs isn't getting a lot of traction, so I'm wary of trying to thread piecemeal fixes through petastorm, instead of just isolating the problem by instantiating the pyarrow.FileSystem as a client. I'm going to close this out and open a PR to pass in the FS as an argument, instead.

dmcguire81 commented 4 years ago

Turns out trying to make metadata discovery more efficient by adding filters is what caused the defect to crop up, in the first place, as the deadlock doesn't exist without the filters argument. @selitvin there was no way that petastorm could have found or anticipated this defect.