ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.48k stars 5.69k forks source link

[Core] Backpressure of gRPC server never worked. #25289

Open lixin-wei opened 2 years ago

lixin-wei commented 2 years ago

We just found that the backpressure of gRPC server never worked.

This is because even if we don't ask for a new request in the server, the server will still keep reading data from the network in cq_->Next(&tag, &ok)(there are many requests pending), causing huge memory usage.

https://github.com/ray-project/ray/blob/d95009a3ac44a9ee2844964b31fa25f38d083388/src/ray/rpc/grpc_server.cc#L148

I think it should be an issue about gRPC, submitted here

rkooo567 commented 2 years ago

cc @scv119 @iycheng (since we discussed backpressure as a part of scalability improvement)

rkooo567 commented 2 years ago

Hey @lixin-wei how has the investigation gone so far?

lixin-wei commented 2 years ago

Unfortunately I have no idea so far except waiting for gRPC community's response. sad.

We optmized the number of the call which caused OOM to work around this first.

fishbone commented 2 years ago

Should we remove the back-pressure in client side given that it's not working and adding too much complexity? @rkooo567 @scv119 ?

lixin-wei commented 2 years ago

Someone replied my question in Stackoverflow, I'll try it next Monday. https://stackoverflow.com/questions/72424145/how-to-do-server-side-backpressure-in-grpc/73255069#73255069

UPDATE: Sadly it doesn't work.

rkooo567 commented 2 years ago

I am down to remove the feature or turn it off by default. We can start discussing how to improve the mechanism after 2.0 (along with another stability improvement like gRPC config improvement). Please let us know if you guys have any proposal @wumuzi520 @lixin-wei

rkooo567 commented 1 year ago

Hey @iycheng let's remove this? However, we definitely need backpresure for the stability, and this can be handled as a part of GCS scalability?

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.