ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[Dashboard] Ray Dashboard not showing tasks view #35132

Open achordia20 opened 1 year ago

achordia20 commented 1 year ago

What happened + What you expected to happen

After added GCS FT, I lost the ability to view task list/count/progress from the Jobs view. I was able to view the tasks in the Tasks Table though. The head node was restarted before the issue appeared not sure if this is relevant or not.

Screen Shot 2023-05-08 at 12 28 11 AM Screen Shot 2023-05-08 at 12 27 09 AM Screen Shot 2023-05-08 at 12 26 42 AM

Versions / Dependencies

Ray 2.4.0

Reproduction script

import ray
import time

ray.init()

database = [
    "Learning", "Ray", "Flexible", "Distributed", "Python", "for", "Machine", "Learning"
]

# Store the database in ray's memory object store
db_object_ref = ray.put(database)

# reserve 2.5GiB of available memory to place this actor
@ray.remote(num_cpus=2, memory=2500 * 1024 * 1024)
class DataTracker:
    def __init__(self):
        self._counts = 0

    def increment(self):
        self._counts += 1

    def counts(self):
        return self._counts

@ray.remote(num_cpus=0.1)
def retrieve_tracker_task(item, tracker, db):
    time.sleep(item / 10.)
    tracker.increment.remote()
    return item, db[item]

# Create an Actor task
tracker = DataTracker.remote()

# Retrieve object reference for data stored in object store.
object_references = [
    retrieve_tracker_task.remote(item, tracker, db_object_ref) for item in range(len(database))
]

# Pull data from object store
data = ray.get(object_references)

print(f"Total items in database: {ray.get(tracker.counts.remote())}")

Issue Severity

None

scottsun94 commented 1 year ago

cc: @rkooo567 @rickyyx to investigate and see whether this is an easy fix for @iycheng

rkooo567 commented 1 year ago

This is expected now. We should probably update the FT doc to mention the limitation cc @iycheng.

Since we don't persist the task information now (it also may add a lot of overhead to GCS if we do so), it is not possible to retrieve the task data when the head node is restarted.

rkooo567 commented 1 year ago

The task should still run as expected IIRC. This feature is just broken because we don't persist task events to persistent storage.

achordia20 commented 1 year ago

I was seeing this on new jobs being submitted. I can understand losing state if the head node is restarted so older jobs data isn’t visible but not after the node is up.

rkooo567 commented 1 year ago

Do you restart the GCS in the middle of your script, or did you run script -> retart -> run another script (and data is lost)?

achordia20 commented 1 year ago

I don’t touch GCS at all. To do any sort of upgrades to the cluster I have to delete the entire Ray cluster and recreate it. After doing so I run the script and don’t see any tasks information in 2/3 places.

rkooo567 commented 1 year ago

Hmm yeah this seems to be a bug from the task backend then. cc @rickyyx to reproduce it.