[core][serve] 1MB latency performance regression

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.94k stars 5.77k forks source link

[core][serve] 1MB latency performance regression #46428

Closed zcin closed 2 months ago

zcin commented 4 months ago

What happened + What you expected to happen

Around 6/14, latency for sending a request with 1MB payload through a serve DeploymentHandle increased from ~3.4s to ~4.6s.

From bisecting, https://github.com/ray-project/ray/commit/d729815c4b88232dcb20860ff5ee1e7f871111f4 seems to be the offending commit.

Versions / Dependencies

n/a

Reproduction script

Run python release/serve_tests/workloads/microbenchmarks.py.

Issue Severity

None

anyscalesam commented 3 months ago

Talked to @edoakes > no leads... @kevin85421 is going to go into Ray Serve source ... this is interrupt important ... and we don't want to revert the DAG PR for this either... so no choice have to investigate.

Suspicious PR is #45699

jjyao commented 3 months ago

Found that the slowness is from Router._resolve_deployment_responses which is basically pickle.dump and pickle.load. It's unclear how https://github.com/ray-project/ray/pull/45699 affects it since if we just run handle 1mb, it's the same before and after that PR. It's only slower if we run handle noop before it.

Instead of figuring out why pickle.dump is slower, we decided to remove the call of Router._resolve_deployment_responses` all together since it turns out to contribute a large portion of the latency.

jjyao commented 3 months ago

Assigning back to @zcin for tracking the removal of Router._resolve_deployment_responses