Open richardliaw opened 4 years ago
cc @clarkzinzow
Thanks for raising this issue @richardliaw.
For more context @clarkzinzow, I'm finding that even with a dataset 1/12 the size (157M), this error is occurring when attempting to write to Parquet (but not, interestingly, when calling compute()
with the Ray scheduler). So it appears to be something related to the serialization / parquet writing procedure when using the Ray scheduler.
Let me know if you need more info from me. This is currently a blocker for me, so happy to help in any way I can.
I take that back, I'm actually seeing this for compute()
calls with Ray as well. Will keep digging into this to find a simpler repro for you.
In my attempt to reproduce, I'm getting a lot of segfaults, not sure what the cause is yet. I'll keep investigating.
Let me know if you are able to create a simpler reproduction, that might help me see through a lot of the noise here.
Hmm interesting, computing this without using the Dask-on-Ray scheduler crashed my dev VM! Maybe there's an underlying Dask or Parquet issue at work here. Are you able to run this workload to completion with a non-Ray scheduler? How large of a machine are you running this on?
Yes, I can run without the Ray scheduler. I'm using a MacBook Pro with 16GB of RAM.
I'm able to run this script on a MBP with 64GB of Ram. I can reliably produce failure with:
GB = 1024 ** 3
ray.init(_memory=4*GB, object_store_memory=2*GB)
This produces a ray.cluster_resources()
of:
{'CPU': 16.0, 'memory': 81.0, 'object_store_memory': 28.0, 'node:192.168.1.115': 1.0}
Which is similar to Travis' output of:
{'object_store_memory': 33.0,
'CPU': 12.0,
'node:192.168.4.54': 1.0,
'memory': 97.0}
@tgaddair @richardliaw, which versions of Ray, Dask, and Pandas are y'all on?
ray==1.0.0
dask==2.28.0
pandas==1.1.2
Thanks @clarkzinzow for continuing to look into this!
ray==master
dask==2.28
pandas==1.0.5
thanks a bunch @clarkzinzow !
I'm still working on getting to the bottom of the segfaults on Parquet table writing that I keep hitting, which is concerning in and of itself. I'll try to poke at this a bit more tomorrow.
I'm assuming that both of y'all have fastparquet
installed, and that Dask is using fastparquet
and not pyarrow
as the Parquet table writer? I get consistent segfaults with pyarrow
as the Parquet I/O implementation, and the object store fills up with fastparquet
.
@clarkzinzow nope didn't know that was a thing! trying it out now.
@richardliaw Interesting, not sure why the pyarrow Parquet implementation is segfaulting for me and not you. What version of pyarrow are you using?
pyarrow 1.0.1
though maybe we bundle it internally?
though maybe we bundle it internally?
Not anymore! 😁
ah nice! Let me know what other information you need to help debug your segfault.
I'll probably end up opening a separate issue for the segfaults since it seems to be specific to using the pyarrow Parquet writer on Linux. Given that I have a workaround (using the fastparquet writer instead), I'm going to ignore the segfaults and keep investigating the object store OOMs so we can hopefully unblock @tgaddair's work.
Hey @clarkzinzow, sorry for the delayed response. I'm actually using pyarrow
, I don't have fastparquet
installed. In my case, we use Pyarrow to read the data after Dask writes it, so we may need to stick with pyarrow.
I'm using pyarrow 2.0.0
by the way.
@tgaddair Sweet, thanks! And if pyarrow
isn't segfaulting for you, then no need to switch to fastparquet
! Given that the OOMs are happening even without writing to Parquet files (just computing dataset
), I'm considering it safe for me to debug the bloated object store while using fastparquet
as the writer, with the hope that the solution will transfer.
So, despite the source dataset being only 2GiB, I think that the working set for this load + repartition + merge is fundamentally large due to Dask's partitioned dataframe implementation (not due to the implementation of the Dask-on-Ray scheduler), and that this results in a failure when using the Ray scheduler because the working set is limited to the allocated object store memory instead of the memory capacity of the entire machine. Running this sans the Ray scheduler yields a peak memory utilization of 20GiB, a lot of which is a working set of intermediate task outputs that must be materialized in memory concurrently. If you have allocated less than 6GiB to the object store, then you can expect to see OOMs since that object store memory budget can't meet the peak demand.
Until the underlying bloated memory utilization is fixed, you could try allocating e.g. 14GiB of memory to Ray and 7GiB to the object store in particular via
GB = 1024 ** 3
ray.init(_memory=14*GB, object_store_memory=7*GB)
to see if you have better luck, or try running this on a cloud instance or a set of cloud instances with more memory available.
If I'm able to confirm that there are some unnecessary dataframe copies happening in Dask's dataframe implementation, then I'll open an issue on the Dask repo, and I'll double-check to make sure that the object store is freeing unused intermediate task outputs at the expected times.
@clarkzinzow thanks a bunch for this thorough evaluation!
So from what I understand, the same memory utilization occurs on both workloads, but the difference is that Dask does not have "memory limits" while Ray enforces memory limits. Is that correct?
@richardliaw That's my current theory, yes! I do want to find out why the working set is so large during repartitioning and merging, and whether intermediate objects are being garbage collected in a timely manner (which I currently think they are, just want to double-check). Hopefully I'll have some time tomorrow to dig further and open up some issues.
What is the problem?
We get an OOM though the working set is much smaller than expected.
Reproduction (REQUIRED)
Using: the netflix dataset: https://www.kaggle.com/netflix-inc/netflix-prize-data
cc @tgaddair