ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.06k stars 5.78k forks source link

[dashboard] not counting any workers #10102

Closed meyerzinn closed 3 years ago

meyerzinn commented 4 years ago

What is the problem?

All nodes report 0 workers, despite the Ray tune trials running on the cluster. I've verified that the processes are running and that they start with "ray::" so they should be included in the count.

image image

Ray version and other system information (Python version, TensorFlow version, OS): Latest nightly version, Python 3.7, running on Red Hat 7.

snmhaines commented 4 years ago

I am suddenly getting a similar looking problem (using Ubuntu 18.04): the cluster is launching with no errors, but the EC2 console is only showing the head-node - even if I only requested one worker. I have opened a support case with AWS, in case it is their fault.

meyerzinn commented 4 years ago

@snmhaines Unless I'm misunderstanding you, that sounds like it might be a different problem. I've never used AWS for Ray so I can't speak to their console, but the problem I'm seeing is that Ray workers (which exist and are running properly) are not being counted in the Ray dashboard.

richardliaw commented 4 years ago

Hey both, thanks a bunch for reaching out.

@20zinnm can you provide some Tune output (i.e. the resource string)? Just want to check that the cluster is connected to properly.

@snmhaines Can you open a separate issue? It sounds like you're using the Ray cluster launcher right? Can you post some of the output from ray monitor [clusteryaml] in that separate issue?

meyerzinn commented 4 years ago

@richardliaw Sorry, I'm not quite sure what you mean by Tune output. The cluster is all connected properly; here I am using 15x 28-core machines (56 threads):

image

And I've SSH'd into various nodes to make sure there are ray:: processes listed in htop.

richardliaw commented 4 years ago

yeah that's what I wanted to see. thanks!

is it possible that there's 2 ray clusters running at once..?

meyerzinn commented 4 years ago

@richardliaw I don't believe so, no. At least I don't see any signs of another ray cluster, and I'm supposed to have exclusive access to these nodes (allocated via SLURM).

richardliaw commented 4 years ago

Thanks - I'll follow up within the team!

@mfitton looks like the dashboard isn't rendering some actors properly?

BTW, @20zinnm would love to hear about what you're working on and what we can do to make Tune/Ray better. Would you be willing to hop on a short 30-minute call sometime next week to chat? Just let me know!

meyerzinn commented 4 years ago

@richardliaw Thank you for your help! I'd be happy to chat next week. Feel free to email me (meyerzinn at gmail) and we can find a time.

snmhaines commented 4 years ago

@richardliaw : I will leave this problem until tomorrow. It will either have gone away, or I will hear from AWS.
However, I have opened a new issue about my next problem: Too many open files error #10104 , which you might take a look at.

mfitton commented 4 years ago

Thanks @richardliaw I'll look into it.

meyerzinn commented 4 years ago

By the way, @mfitton I verified using the Firefox network inspector that the JSON responses from node_info correctly lists all of the workers for each node. The bug is likely on the React side of things.

snmhaines commented 4 years ago

Thanks for picking this up @mfitton,

logs.zip

I still have the problem with lack of workers on EC2 today. The AWS support doesn't quite understand the problem yet, so I have sent them more screenshots. In case you can find any clues, I am attaching the example of a launch that I just did requesting 2 on-demand workers (to make things easier). Again, only the head-node was visible on the EC2 Management Console Instances screen. Usually, the workers have at least started to initialize by the time the launch process completes in ray.

mfitton commented 4 years ago

@snmhaines @20zinnm I'm able to intermittently recreate this issue on a cluster I spawned up. It seems like when I submit a script, new workers will be spun up, and I'm able to see their progress, but when they finish, at times (but not others) those workers' rows disappear from the machine view. I'm trying to figure out what's causing it, but being able to reproduce it is a good sign.

Does my description gel with what you two experienced?

mfitton commented 4 years ago

@20zinnm would you be able to send me the payload that you're seeing from the node_info response in Firefox? If you could send me the raylet_info response too that would be really helpful.

I'm having trouble recreating it in my local dev environment where I can set breakpoints and stuff, so I'm thinking it would be helpful to see the payloads you're receiving to see if I can reason about where the error is popping up.

I'm fine with you posting it here, or if you don't want to post it publicly, you can send it to me in the Ray slack.

meyerzinn commented 4 years ago

@mfitton My experience with it is workers never show up on the dashboard. I will send you the payload via Slack when I get a chance.

snmhaines commented 4 years ago

@mfitton, I think that @20zinnm is right; we have two different problems with a similar symptom. In my case, I never see any workers starting. I still suspect that it is caused at the AWS end, because at the end of Wednesday, after a lot of starting and stopping clusters (to debug the other problem), the number of workers started to reduce. So I switched to on-demand - which worked once - and then further spin-ups produced no workers at all. This sort of stochastic behaviour could be caused by variable demand on the AWS capacity and adaptive prioritization of different customers.

snmhaines commented 4 years ago

@mfitton, I think that I have found a clue in the monitor.err log for a cluster that I tried today:-

Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 887, in custom_excepthook
    worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'

Original exception was:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 334, in <module>
    redis_password=args.redis_password)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 53, in __init__
    self.load_metrics)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 60, in __init__
    self.reload_config(errors_fatal=True)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 286, in reload_config
    raise e
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 273, in reload_config
    new_config["cluster_synced_files"],
KeyError: 'cluster_synced_files'

I do not understand why the worker_id had been lost, but I am attaching all the relevant information. Note that the AWS Cloud Trail event for the launch shows minCount and maxCount as 1, despite the fact that .yaml config file asks for 20.

logs.zip

snmhaines commented 4 years ago

@mfitton, I think that the real error is "KeyError: 'cluster_synced_files'" thrown up when the autoscaler initializes the cluster. I do not use any cluster-synced files, but I tried adding that section with no files (as in the example_full.yaml) but just got a syntax error. I did find a typo in my file_mounts: section of the config, but correcting it didn't make any difference. I have tried a number of different variations on that section, including no file mounts at all, but the result is the same. I am running out of ideas now.

richardliaw commented 4 years ago

Can you put a cluster_synced_files: null into your yaml?

snmhaines commented 4 years ago

That just produced the same error as cluster_synced_files: []

jsonschema.exceptions.ValidationError: Additional properties are not allowed ('cluster_synced_files' was unexpected)

This is the section:-


# Files or directories to copy to the head and worker nodes. 
file_mounts: {
    "~/run_files": "./run_files",
}

# Only use this if you know what you're doing!
cluster_synced_files: null

# List of commands run before `setup_commands`.
initialization_commands: []
'''
snmhaines commented 4 years ago

Could this be a result of this issue: [autoscaler] Throw error with missing cluster_synced_files. #9965 ? If so, can you suggest a quick temporary fix to autoscaler.py?

snmhaines commented 4 years ago

I find that my version of ray-schema.json (with Ray 0.8.6) has no description of cluster_synched_files, so I put one in (the same format as file_mounts), and put "cluster_synched_files: {}" in the config file. This got rid of the error, and the cluster span up, but still no workers and, this time, no errors in the monitor.err log!

logs.zip

Also, max and minCount in the AWS Cloud Trail event log are still only set to 1:-

{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "Root",
        "principalId": "601784597600",
        "arn": "arn:aws:iam::601784597600:root",
        "accountId": "601784597600",
        "accessKeyId": "AKIAJIKJUPOT6KV3A4WA"
    },
    "eventTime": "2020-08-19T22:00:23Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RunInstances",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "73.145.166.129",
    "userAgent": "Boto3/1.12.12 Python/3.6.9 Linux/5.4.0-42-generic Botocore/1.15.12 Resource",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "imageId": "ami-0ac80df6eff0e70b5",
                    "minCount": 1,
                    "maxCount": 1,
                    "keyName": "ray-key2_us-east-1"
                }
            ]
        },
snmhaines commented 4 years ago

Well, I had to do this the hard way, starting from an old version of my config that and then adding in and subtracting different combinations of node set-up commands (and versions thereof) until I found which worked (attached).
The cluster_synced_files problem seems to be fixed in ray 0.8.7 and 0.9.0; so the block doesn't even need to be included. However v 0.9.0 doesn't start up the workers, at least not as configured in the present state of https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml. For my purposes, the Anaconda3 package installation is necessary, but a newer version that I tried (2018.12) seemed unable to find, or even install, pip. Using the old version 5.0.1 requires an upgrade of pip and setuptools, and the installation of a different version of scipy. Similarly sticking to boto3 v1.4.8 seems necessary.

Now back to the "Too many open files error #10104 "

conf_east_c5.24xlarge.zip

mfitton commented 3 years ago

@20zinnm this issue is fixed in the nightly / version of Ray to be released. It was present in the previous version of the dashboard api, but I haven't seen it or any reports of similar in the new dashboard infrastructure. Please reopen if you encounter this issue going forward. Thanks!