ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

[State api] Cannot list running tasks due to the 10000 api limit #33547

Open scottsun94 opened 1 year ago

scottsun94 commented 1 year ago

Context

Ray has a state API that's used to expose internal ray states, e.g. tasks, workers etc. See this for usage, and this for the architecture

When querying the state API data, data will be queried from places like raylet, gcs (mainly from gcs now). See this for various components of ray cluster.

Since we currently don't have a streaming interface for grpc, therefore, when querying states from GCS, we apply a hard limit on the grpc data payload (in terms of number of entries). Right now we have a hard limit set by RAY_MAX_LIMIT_FROM_DATA_SOURCE when a ray cluster is started. The current 10k limit is too restrictive and it's not configurable once user started the cluster.

Solutions

Possible follow-up

More follow-up Would be nice if we have pagination / streaming API for the data transfer from source to the API server.

Testing

We could change the release test to test a higher upperbound: https://github.com/ray-project/ray/blob/3b09a5491a868c93571f391801aba4fca68e0321/release/nightly_tests/stress_tests/test_state_api_scale.py#L312-L319

What happened + What you expected to happen

Versions / Dependencies

nightly

Reproduction script

Some repro from Huaiwei : I already ran several jobs on the cluster. In total, there are probably more than 50k tasks. I started a new job and wanted to see the running tasks

We probably should filter first before applying the 10000 limit. Otherwise, I cannot list running tasks (this probably applies to tasks in other states too) at all. This command is not very useful then.

(base) ray@ip-10-0-25-201:~/default$ ray list tasks -f state=running
/home/ray/anaconda3/lib/python3.10/site-packages/ray/experimental/state/api.py:374: UserWarning: The returned data may contain incomplete result. 10000 (53572 total from the cluster) tasks are retrieved from the data source. 43572 entries have been truncated. Max of 10000 entries are retrieved from data source to prevent over-sized payloads.
  warnings.warn(
No resource in the cluster
Screen Shot 2023-03-21 at 2 38 35 PM

n/a

Issue Severity

None

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

scottsun94 commented 1 year ago

Unstale