Open scottsun94 opened 1 year ago
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Unstale
Context
Ray has a state API that's used to expose internal ray states, e.g. tasks, workers etc. See this for usage, and this for the architecture
When querying the state API data, data will be queried from places like raylet, gcs (mainly from gcs now). See this for various components of ray cluster.
Since we currently don't have a streaming interface for grpc, therefore, when querying states from GCS, we apply a hard limit on the grpc data payload (in terms of number of entries). Right now we have a hard limit set by RAY_MAX_LIMIT_FROM_DATA_SOURCE when a ray cluster is started. The current 10k limit is too restrictive and it's not configurable once user started the cluster.
Solutions
Possible follow-up
More follow-up Would be nice if we have pagination / streaming API for the data transfer from source to the API server.
Testing
We could change the release test to test a higher upperbound: https://github.com/ray-project/ray/blob/3b09a5491a868c93571f391801aba4fca68e0321/release/nightly_tests/stress_tests/test_state_api_scale.py#L312-L319
What happened + What you expected to happen
Versions / Dependencies
nightly
Reproduction script
Some repro from Huaiwei : I already ran several jobs on the cluster. In total, there are probably more than 50k tasks. I started a new job and wanted to see the running tasks
We probably should filter first before applying the 10000 limit. Otherwise, I cannot list running tasks (this probably applies to tasks in other states too) at all. This command is not very useful then.
n/a
Issue Severity
None