Closed dmcguire81 closed 4 years ago
Yep. Agreed. Can you please elaborate what is the issue?
There's nothing new to report: better throughput, in exchange for less predictable memory consumption, is a good trade-off for us, and probably for others, so it should be available in the factory methods.
While we're talking, though, it seems like the factory methods only work seamlessly if on whatever storage Uber happens to use (guessing HDFS, given libhdfs
is the only config given consistently elevated visibility). The experience on S3 is pretty bad.
We're considering dropping down to the petastorm.reader.Reader
interface to deal with having to patch s3fs
and pyarrow
for the respective shortcomings (or this code base to work around them), but wanted to get your take on a refactoring to the factory methods, themselves, since it seems like a losing battle to offer all the flexibility users will need and still all do the instantiation yourself.
Your diagnosis is correct. We did not use petastorm with s3 internally, hence this is not a polished feature. I can help if you can point me to the issues you run into. I did verify basic functionality but not more then that.
I saw that you created #594. If this is not sufficient (curious what else would we need), perhaps we can have make_reader/make_batch_reader
take an instance of already constructed filesystem object?
We built up a bit of a backlog while waiting on legal to approve submitting PRs back, but they're all in there, now, so you see everything that's been needed, so far (except for s3fs==0.5.0
throwing TypeError: 'coroutine' object is not iterable
from make_batch_reader
, which I just found today, by accident, trying to build you a repro case).
I think passing the pyarrow.Filesystem
is a great option, since the rest of the code that's hidden (e.g., instances of petastorm.worker_pool.*
) is under your direct control, so won't be as much of a maintenance burden to keep "under the hood".
For your part, having it be parameterized instead of discovered, makes it so that you can route defects to those project (pyarrow
and s3fs
) where they belong. Without making s3fs
at least an optional extension, it probably shouldn't be referenced in the code, at all, just advertised as "compatible".
So if I understand correctly if we land all PRs you have put up we would solve your outstanding problems, correct?
Yes, but we have a private fork we're working off of, so are not blocked. I'm really interested in the future-proofing idea of adding a filesystem
parameter, but wanted to document the evidence for it with illustrative PRs.
If you want to merge these and then discuss a refactoring, great, but I'm assuming you don't want to add parameters to the factory methods and then remove them.
I would image that s3_config_kwargs
would solve 90% of user problems (speculating, since did not work with s3 myself).
If this is the case, then it may be a good idea to land it even that we would have a full-blown filesystem object argument later for other usecases (say filesystems that are currently not supported by petastorm).
Makes sense. I'll update that PR with release notes for you to merge.
One of our engineers found an optimization involving disabling ZeroMQ copy buffers in the
ProcessWorker
, but this is not exposed in the top-level factory methods,make_reader
andmake_batch_reader
. It's useful, and probably should be.