uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Expose the flag to disable Ømq copy buffers #595

Closed dmcguire81 closed 4 years ago

dmcguire81 commented 4 years ago

One of our engineers found an optimization involving disabling ZeroMQ copy buffers in the ProcessWorker, but this is not exposed in the top-level factory methods, make_reader and make_batch_reader. It's useful, and probably should be.

selitvin commented 4 years ago

Yep. Agreed. Can you please elaborate what is the issue?

dmcguire81 commented 4 years ago

There's nothing new to report: better throughput, in exchange for less predictable memory consumption, is a good trade-off for us, and probably for others, so it should be available in the factory methods.

dmcguire81 commented 4 years ago

While we're talking, though, it seems like the factory methods only work seamlessly if on whatever storage Uber happens to use (guessing HDFS, given libhdfs is the only config given consistently elevated visibility). The experience on S3 is pretty bad.

dmcguire81 commented 4 years ago

We're considering dropping down to the petastorm.reader.Reader interface to deal with having to patch s3fs and pyarrow for the respective shortcomings (or this code base to work around them), but wanted to get your take on a refactoring to the factory methods, themselves, since it seems like a losing battle to offer all the flexibility users will need and still all do the instantiation yourself.

selitvin commented 4 years ago

Your diagnosis is correct. We did not use petastorm with s3 internally, hence this is not a polished feature. I can help if you can point me to the issues you run into. I did verify basic functionality but not more then that.

selitvin commented 4 years ago

I saw that you created #594. If this is not sufficient (curious what else would we need), perhaps we can have make_reader/make_batch_reader take an instance of already constructed filesystem object?

dmcguire81 commented 4 years ago

We built up a bit of a backlog while waiting on legal to approve submitting PRs back, but they're all in there, now, so you see everything that's been needed, so far (except for s3fs==0.5.0 throwing TypeError: 'coroutine' object is not iterable from make_batch_reader, which I just found today, by accident, trying to build you a repro case).

I think passing the pyarrow.Filesystem is a great option, since the rest of the code that's hidden (e.g., instances of petastorm.worker_pool.*) is under your direct control, so won't be as much of a maintenance burden to keep "under the hood".

dmcguire81 commented 4 years ago

590 was actually the real killer, but that addresses every problem we've seen, so far, except this one, so we could get it all done in 2 PRs, probably.

dmcguire81 commented 4 years ago

For your part, having it be parameterized instead of discovered, makes it so that you can route defects to those project (pyarrow and s3fs) where they belong. Without making s3fs at least an optional extension, it probably shouldn't be referenced in the code, at all, just advertised as "compatible".

selitvin commented 4 years ago

So if I understand correctly if we land all PRs you have put up we would solve your outstanding problems, correct?

dmcguire81 commented 4 years ago

Yes, but we have a private fork we're working off of, so are not blocked. I'm really interested in the future-proofing idea of adding a filesystem parameter, but wanted to document the evidence for it with illustrative PRs.

If you want to merge these and then discuss a refactoring, great, but I'm assuming you don't want to add parameters to the factory methods and then remove them.

selitvin commented 4 years ago

I would image that s3_config_kwargs would solve 90% of user problems (speculating, since did not work with s3 myself). If this is the case, then it may be a good idea to land it even that we would have a full-blown filesystem object argument later for other usecases (say filesystems that are currently not supported by petastorm).

dmcguire81 commented 4 years ago

Makes sense. I'll update that PR with release notes for you to merge.