Closed dmcguire81 closed 4 years ago
Merging #592 into master will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #592 +/- ##
=======================================
Coverage 85.69% 85.69%
=======================================
Files 87 87
Lines 4993 4993
Branches 794 794
=======================================
Hits 4279 4279
Misses 578 578
Partials 136 136
Impacted Files | Coverage Δ | |
---|---|---|
petastorm/etl/dataset_metadata.py | 88.00% <100.00%> (ø) |
|
petastorm/reader.py | 90.24% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 4b21a70...9229022. Read the comment docs.
@selitvin how is this PR failing for code coverage in files that it did not change?
@selitvin how is this PR failing for code coverage in files that it did not change?
Not sure. Some glitch that has to do probably with one of the source files have a line removed.
Should we make the decision here automatically based on the filesystem used so we don't impact users of other file systems?
We could, but the pyarrow.Filesystem
abstraction is basically broken, and it's up to you how much you want to have to maintain different code paths for different filesystems to work around that. Happy to change this or pursue a refactoring.
Would it look like a defaultdict
with 1 entry for the broken filesystem, and a default of 10?
but the pyarrow.Filesystem abstraction is basically broken
Can you clarify what are you referring to?
Another thought: would introducing serialization in S3FSWrapper
guard underlying non-thread-safe code? That would be nicer than having to choose 1 vs 10 (which I agree is a maintenance headache)
but the pyarrow.Filesystem abstraction is basically broken
Can you clarify what are you referring to?
Sure, just that pyarrow
assumes the s3fs
code is threadsafe, when it clearly is not. That's they're problem to deal with, but doing the instantiation in petastorm
makes it also your problem.
Another thought: would introducing serialization in
S3FSWrapper
guard underlying non-thread-safe code? That would be nicer than having to choose 1 vs 10 (which I agree is a maintenance headache)
It could, and I think fixing the problem within pyarrow
is even better than fixing it within petastorm
, but probably not as good as fixing it with s3fs. Unfortunately, I'm having trouble navigating the defect reporting process with Apache Arrow, since it's based on Jira, not, GitHub. It's unclear if I'll need an additional login (and associated legal approval on my side), so if you can navigate that process and see your way through it, that would be great!
@selitvin the conversation with s3fs isn't getting a lot of traction, so I'm wary of trying to thread piecemeal fixes through petastorm
, instead of just isolating the problem by instantiating the pyarrow.FileSystem
as a client. I'm going to close this out and open a PR to pass in the FS as an argument, instead.
Turns out trying to make metadata discovery more efficient by adding filters
is what caused the defect to crop up, in the first place, as the deadlock doesn't exist without the filters
argument. @selitvin there was no way that petastorm
could have found or anticipated this defect.
Fixes #590 and #591