Open krfricke opened 2 years ago
These benchmarks were on a Macbook 2019 16 inch. This might be related to object spilling (?) - we should run this on beefier instances to confirm (I can do that in a bit).
Is it possible the same issue is reproducible from Linux?
Also cc @scv119 @stephanie-wang (can you guys help finding the owner for this work?)
Benchmarks on a m5.8xlarge
instance (128 GB RAM, linux):
Mean (SD) over 10 runs
0.25 GB: 1.19 (0.12)
0.50 GB: 2.19 (0.01)
1.00 GB: 4.46 (0.12)
1.50 GB: 6.61 (0.13)
2.00 GB: 8.97 (0.32)
3.00 GB: 13.20 (0.24)
4.00 GB: 14.49 (0.30)
8.00 GB: 31.76 (2.74)
16.00 GB: 114.93 (27.13)
24.00 GB: 230.28 (33.94)
Usage:
0.0/32.0 CPU
0.00/77.348 GiB memory
0.00/37.140 GiB object_store_memory
So this seems to be related to memory/object store size (might be spilling).
Adjusted benchmark script for multiple runs:
import os
import shutil
import time
from collections import defaultdict
import numpy as np
import ray
from ray.ml import Checkpoint
def create_large_checkpoint(size_in_gb: float) -> str:
path = f"/tmp/checkpoint_large_{size_in_gb:.2f}_gb"
if os.path.exists(path):
# print(f"Return existing checkpoint in {path}")
return path
# print(f"Creating checkpoint in {path}")
os.makedirs(path)
with open(os.path.join(path, "checkpoint.bin"), "wb") as fp:
fp.write(os.urandom(int(size_in_gb * 1024 * 1024 * 1024)))
return path
ray.init()
repeats = 5
results = defaultdict(list)
for checkpoint_size in [32.0]:
for i in range(repeats):
checkpoint_path = create_large_checkpoint(size_in_gb=checkpoint_size)
shutil.rmtree("/tmp/checkpoint_benchmark_output", ignore_errors=True)
start_time = time.monotonic()
Checkpoint.from_object_ref(
Checkpoint.from_directory(checkpoint_path).to_object_ref()
).to_directory("/tmp/checkpoint_benchmark_output")
time_taken = time.monotonic() - start_time
results[checkpoint_size].append(time_taken)
print(f"Checkpoint ser/de for {checkpoint_size:.2f} GB took {time_taken:.2f} seconds.")
shutil.rmtree(f"/tmp/checkpoint_large_{checkpoint_size:.2f}_gb")
print(f"{checkpoint_size:.2f} GB: {list(results[checkpoint_size])}")
print("Raw")
for size, res in results.items():
print(f"{size:.2f} GB: {list(res)}")
print("Summary")
for size, res in results.items():
arr = np.array(res)
print(f"{size:.2f} GB: {np.mean(arr):.2f} ({np.std(arr):.2f})")
we should investigate to see where it slows down, and decide what to do next.
This seems not only happening for mac, but also for linux.
0.25 GB: 1.19 (0.12)
0.50 GB: 2.19 (0.01)
1.00 GB: 4.46 (0.12)
1.50 GB: 6.61 (0.13)
2.00 GB: 8.97 (0.32)
3.00 GB: 13.20 (0.24)
4.00 GB: 14.49 (0.30)
8.00 GB: 31.76 (2.74)
16.00 GB: 114.93 (27.13)
24.00 GB: 230.28 (33.94)
Basically, it's not scaling up linearly.
Benchmark the parallel copy and get the following metrics:
1 GB 111 ms
2 GB 222 ms
3 GB 997 ms
4 GB 5432 ms
5 GB 5748 ms
6 GB 6989 ms
I think it's highly related to parallel copy. Let me see how to fix this.
If it is related to page fault and swap to disk, maybe this could be helpful? https://man7.org/linux/man-pages/man2/memfd_create.2.html There are not macOS correspondings though
Per Triage Sync: @scv119 can you please clarify the current status on this?
Just put the code into ray.remote
.
from __future__ import annotations
import os
import shutil
import time
from collections import defaultdict
import numpy as np
import ray
from ray.train import Checkpoint
ray.init()
def create_large_checkpoint(size_in_gb: float) -> str:
path = f"/tmp/checkpoint_large_{size_in_gb:.2f}_gb"
if os.path.exists(path):
return path
os.makedirs(path)
with open(os.path.join(path, "checkpoint.bin"), "wb") as fp:
fp.write(os.urandom(int(size_in_gb * 1024 * 1024 * 1024)))
return path
@ray.remote
def benchmark(checkpoint_size: float, repeats: int) -> list[float]:
checkpoint_path = create_large_checkpoint(checkpoint_size)
times_taken: list[float] = []
for _ in range(repeats):
shutil.rmtree("/tmp/checkpoint_benchmark_output", ignore_errors=True)
start_time = time.monotonic()
checkpoint = ray.get(ray.put(Checkpoint.from_directory(checkpoint_path)))
checkpoint.to_directory("/tmp/checkpoint_benchmark_output")
times_taken.append(time.monotonic() - start_time)
shutil.rmtree(checkpoint_path)
return times_taken
repeats = 5
results = defaultdict(list)
for checkpoint_size in [0.25, 0.5, 1, 1.5, 2, 3, 4, 8]:
results[checkpoint_size] = ray.get(benchmark.remote(checkpoint_size, repeats))
print(f"{checkpoint_size:.2f} GB: {results[checkpoint_size]}")
print("Raw")
for size, res in results.items():
print(f"{size:.2f} GB: {list(res)}")
print("Summary")
for size, res in results.items():
arr = np.array(res)
print(f"{size:.2f} GB: {np.mean(arr):.2f} ({np.std(arr):.2f})")
What happened + What you expected to happen
See https://github.com/ray-project/ray/pull/25177
Sending large binary strings over the Ray object store scales superlinearly. Benchmarks:
We would expect this to scale linearly (i.e. 2GB should take 10 seconds, 4GB should take 20 seconds).
These benchmarks were on a Macbook 2019 16 inch. This might be related to object spilling (?) - we should run this on beefier instances to confirm (I can do that in a bit).
In Ray Tune, we avoid this currently by chunking the data and streaming it via an actor: https://github.com/ray-project/ray/blob/master/python/ray/tune/utils/file_transfer.py#L280
We can do such chunking in Ray AIR as well, however, I think it would be good to resolve this on the core side, rather than having to do this on the library level.
~Thus, a few ideas:~
~1. Ray core support for generators (see @stephanie-wang's proposal) and maybe a utility for chunking data~ ~2. Streaming support out of the box (automatic chunking/resolving of generators)~
~For 2, we could automatically adjust to available object store size, so that potentially we could transfer objects that are larger than the object store.~
(this doesn't really work as the data remains in the object store)
Versions / Dependencies
Latest master
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.