ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
31.94k stars 5.44k forks source link

[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319

Open stephanie-wang opened 1 month ago

stephanie-wang commented 1 month ago

What happened + What you expected to happen

Microbenchmark results for a single-actor accelerated DAG shows about 30k calls/s, or about 30us/call. That is consistent with other microbenchmarks that @jackhumphries ran for channel performance, showing low 10s of us / channel op.

However, a microbenchmark for the recently added NCCL transport shows about 5.8k calls/s for NCCL alone and 3.2k calls/s for DAG+NCCL. This translates to about 130us / DAG call, more than 4x what's expected.

Versions / Dependencies

3.0dev

Reproduction script

See linked microbenchmarks.

Issue Severity

None

anyscalesam commented 1 month ago

Discussed during standup today > goal is to improve the NCCL performance to be within 50% versus 4x. This will also help with vLLM performance (may or may not impact) but it will also draw that closer as well.

anyscalesam commented 3 weeks ago

Environment setup - about to dive into debugging the overhead.

anyscalesam commented 6 days ago

There are actually two things - one is the perf regression in our existing perf dashboard. The other one is the overheads around our seralizer (cloud pickl) > the first one we should fix for Developer Preview.

stephanie-wang commented 5 days ago

45953 has been merged, which was the P0 regression. The remaining overheads have been tracked down to python serialization of the tensor metadata, which we should be able to optimize later.