Open mvashishtha opened 2 years ago
When the disk spilling is happened (due to 2GB limit), how slow is it compared to Dask & windows Ray when the spilling is invoked due to high memory usage? Do you guys have some number?
how slow is it compared to Dask & windows Ray when the spilling is invoked due to high memory usage?
@rkooo567 to answer your question, I did some performance testing of 3 common kinds of embarrassingly parallel operations in pandas and Modin. I don't have access to a Windows machine, so I tested on my mac and linux (specs for both machines below). Both the mac and linux machines had eight cores, though the linux one had twice as much RAM. Still, I think the difference in RAM didn't matter, because these benchmarks weren't maxing out memory usage on the mac. (and there was no spilling on ray even when using a default object size of 10 GB).
In each case, spilling to disk on Mac is either much better or roughly the same as not spilling to disk, so we should accept the ray limit on the object store size (#4713). However, Ray with the smaller object store is slightly slower than dask on the applymap
(~1.3x as long), about 1.4x as long as dask on apply
, and about 1.5x as long as dask on read_csv
. linux is normally fastest, sometimes by a lot (e.g. 50% in read_csv
).
It would be really nice if ray could work around the macOS mmap bug and have a faster object store instead of limiting the object store size and spilling to disk. Otherwise, depending on how Dask does on a wider variety of benchmarks, Modin might need to make Dask the default engine on macs.
Please let me know whether that assessment makes sense. Detailed results are below.
Mac: 87.25 sec Linux: 98.91 sec
11.98 sec
(10.26, 10.22) sec
11.98
The benchmark script just added import ray
and ray.init()
before creating the dataframes.
I got spilling to disk and the applymap
was 12.92 sec.
Mac: (5.957 sec, 5.706 sec) Linux: 3.488 sec
1.655 sec
(2.247, 2.174, 2.642) sec
17.97 sec
spill to disk and get (3.354 sec, 3.167 sec, 3.484 sec)
mac: (31.37, 33.36) sec linux: 40.57
9.371 sec (with spilling to disk)
(14.90, 13.11) sec
29.35 sec
(21.93, 20.64, 20.42) sec
Mac specs:
modin.__version__
): cc713c5c9055f717ac891cd1fccbd08987e169cdThe Ubuntu EC2 instance:
In https://github.com/ray-project/ray/issues/20388, it turned out that the Ray object store performs very poorly on Mac when it's storing more than about 2 GiB. Ray's solution, https://github.com/ray-project/ray/pull/21224, was to limit the size of the object store to 2 GiB. The result is that ray spills to disk for data in the range of 2 GiB to 10 GiB, where Modin is supposed to perform much better than pandas.
In #4335, we overrode the 2 GiB limit to start Ray with Modin's usual object store size, but it seems that the slow object store is even worse than spilling to disk (see #4713).
For now I think we should do the following:
cc @modin-project/modin-ray