modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.9k stars 653 forks source link

PERF: Deal with Ray's poor performance on data larger than 2 GiB on macOS #4714

Open mvashishtha opened 2 years ago

mvashishtha commented 2 years ago

In https://github.com/ray-project/ray/issues/20388, it turned out that the Ray object store performs very poorly on Mac when it's storing more than about 2 GiB. Ray's solution, https://github.com/ray-project/ray/pull/21224, was to limit the size of the object store to 2 GiB. The result is that ray spills to disk for data in the range of 2 GiB to 10 GiB, where Modin is supposed to perform much better than pandas.

In #4335, we overrode the 2 GiB limit to start Ray with Modin's usual object store size, but it seems that the slow object store is even worse than spilling to disk (see #4713).

For now I think we should do the following:

cc @modin-project/modin-ray

rkooo567 commented 2 years ago

When the disk spilling is happened (due to 2GB limit), how slow is it compared to Dask & windows Ray when the spilling is invoked due to high memory usage? Do you guys have some number?

mvashishtha commented 2 years ago

how slow is it compared to Dask & windows Ray when the spilling is invoked due to high memory usage?

@rkooo567 to answer your question, I did some performance testing of 3 common kinds of embarrassingly parallel operations in pandas and Modin. I don't have access to a Windows machine, so I tested on my mac and linux (specs for both machines below). Both the mac and linux machines had eight cores, though the linux one had twice as much RAM. Still, I think the difference in RAM didn't matter, because these benchmarks weren't maxing out memory usage on the mac. (and there was no spilling on ray even when using a default object size of 10 GB).

In each case, spilling to disk on Mac is either much better or roughly the same as not spilling to disk, so we should accept the ray limit on the object store size (#4713). However, Ray with the smaller object store is slightly slower than dask on the applymap (~1.3x as long), about 1.4x as long as dask on apply, and about 1.5x as long as dask on read_csv. linux is normally fastest, sometimes by a lot (e.g. 50% in read_csv).

It would be really nice if ray could work around the macOS mmap bug and have a faster object store instead of limiting the object store size and spilling to disk. Otherwise, depending on how Dask does on a wider variety of benchmarks, Modin might need to make Dask the default engine on macs.

Please let me know whether that assessment makes sense. Detailed results are below.

Element-wise apply

Benchmark script ```python import modin.pandas as pd import numpy as np from modin.config import BenchmarkMode import time BenchmarkMode.put(True) # Warm ray up test_df = pd.DataFrame(np.random.randint(0, 100, size=(2**9, 2**9))) test_df.applymap(lambda x: x) del test_df df = pd.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))) start = time.time() df.applymap(lambda x: x) end = time.time() print(f'apply took: {end - start}') ```

pandas

Mac: 87.25 sec Linux: 98.91 sec

ray on linux

11.98 sec

mac

dask

(10.26, 10.22) sec

with modin's current ray object store

11.98

with default ray object store

The benchmark script just added import ray and ray.init() before creating the dataframes.

I got spilling to disk and the applymap was 12.92 sec.

column-wise apply

Benchmark script ```python import modin.pandas as pd import numpy as np from modin.config import BenchmarkMode import time BenchmarkMode.put(True) # Warm ray up test_df = pd.DataFrame(np.random.randint(0, 100, size=(2**9, 2**9))) test_df.applymap(lambda x: x) del test_df df = pd.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))) start = time.time() df.apply(lambda col: col) end = time.time() print(f'apply took: {end - start}') ```

pandas

Mac: (5.957 sec, 5.706 sec) Linux: 3.488 sec

ray on linux

1.655 sec

mac

dask

(2.247, 2.174, 2.642) sec

with modin's current ray object store

17.97 sec

with default ray object store

spill to disk and get (3.354 sec, 3.167 sec, 3.484 sec)

read_csv

Generate CSV script ```python import modin.pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(0, 100, size=(2**21, 2**8))) df.to_csv("2mebi_by_256.csv", index=False) ```
Benchmark script ```python import modin.pandas as pd import numpy as np from modin.config import BenchmarkMode import time BenchmarkMode.put(True) # Warm ray up test_df = pd.DataFrame(np.random.randint(0, 100, size=(2**9, 2**9))) test_df.applymap(lambda x: x) del test_df start = time.time() df = pd.read_csv("2mebi_by_256.csv") end = time.time() print(f'read_csv took: {end - start}') ```

pandas

mac: (31.37, 33.36) sec linux: 40.57

ray on linux

9.371 sec (with spilling to disk)

mac

dask

(14.90, 13.11) sec

current modin object store

29.35 sec

default ray object store

(21.93, 20.64, 20.42) sec

Appendix: specs

Mac specs:

The Ubuntu EC2 instance: