Open import-antigravity opened 3 years ago
What's the behavior you're seeing when it stalls/times out? Is this an actual TimeoutError, or is it just hanging with no progress being made? It would be great if you can send what your stdout looks like when you're seeing this happening.
That's the strange thing, it just hangs without starting any tuning trials.
What's the behavior you're seeing when it stalls/times out? Is this an actual TimeoutError, or is it just hanging with no progress being made? It would be great if you can send what your stdout looks like when you're seeing this happening.
2021-09-24 12:32:06,187 INFO services.py:1263 -- View the Ray dashboard at http://...:8265
[2021-09-24 12:32:06,805 I 508 508] global_state_accessor.cc:332: This node has an IP address of ..., while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
[2021-09-24 12:32:06,877 I 29597 29597] global_state_accessor.cc:332: This node has an IP address of ..., while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
Starting WORKER 4 at classt05
[2021-09-24 12:32:07,821 I 27445 27445] global_state_accessor.cc:332: This node has an IP address of ..., while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
[2021-09-24 12:32:07,855 I 31635 31635] global_state_accessor.cc:332: This node has an IP address of ..., while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
DEBUG:root:NCCL: True
DEBUG:ray.worker:Automatically increasing RLIMIT_NOFILE to max value of 131072
2021-09-24 12:32:10,875 INFO worker.py:825 -- Connecting to existing Ray cluster at address: ...:6379
[2021-09-24 12:32:10,882 I 22186 22186] logging.cc:186: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
2021-09-24 12:32:11,055 WARNING tune.py:506 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
slurmstepd: error: _is_a_lwp: open() /proc/1373/status failed: No such file or directory
slurmstepd: error: *** STEP 16419161.4 ON classt04 CANCELLED AT 2021-09-26T00:32:17 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 16419161.2 ON classt02 CANCELLED AT 2021-09-26T00:32:17 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 16419161.5 ON classt05 CANCELLED AT 2021-09-26T00:32:17 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 16419161.1 ON classt01 CANCELLED AT 2021-09-26T00:32:17 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 16419161.3 ON classt03 CANCELLED AT 2021-09-26T00:32:17 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 16419161 ON classt01 CANCELLED AT 2021-09-26T00:32:17 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
I've noticed that while running on my SLURM cluster, if num_workers is set to too many (as far as I can tell this is an arbitrary amount) the job starts and the Ray dashboard shows that GPUs are being fully utilized but the job stalls and times out.
Python script:
SLURM script: