Closed dangalea closed 1 year ago
Hi @dangalea, is your ray version the same (2.3.0) on all nodes?
Hi @justinvyu, yes they are all using the same environment.
Could you try printing out the list of nodes in the cluster, on each node?
Something like:
# Run on your head node
import ray
from ray.air.util.node import _force_on_node
ray.init()
@ray.remote
def log():
print("Me:", ray.get_runtime_context().get_node_id())
print("Me + everyone else:", [node["NodeID"] for node in ray.nodes()])
# Does your head node see everyone?
assert len(ray.nodes()) == 1 # insert your expected value
for node in ray.nodes():
# Do your worker nodes see everyone?
ray.get(log.options(**_force_on_node(node["NodeID"])).remote())
Also, could you try this on your head node?
import ray
from ray.air.util.node import _get_node_id_from_node_ip
ray.init()
print(ray.get_runtime_context().get_node_id())
print(_get_node_id_from_node_ip(ray.util.get_node_ip_address()))
Hi @justinvyu,
I get the following when executing the first snippet on my nodes:
Me: 361e04dc0722ba357d9fa31db01682f56cb1e81e0cf19955dae0e94c
Me + everyone else: ['361e04dc0722ba357d9fa31db01682f56cb1e81e0cf19955dae0e94c', '8ffeb23df2477e6c5ca3c2da37c02463257f5b3d2e59bb8d1fd79d1d']
Me: 8ffeb23df2477e6c5ca3c2da37c02463257f5b3d2e59bb8d1fd79d1d
Me + everyone else: ['361e04dc0722ba357d9fa31db01682f56cb1e81e0cf19955dae0e94c', '8ffeb23df2477e6c5ca3c2da37c02463257f5b3d2e59bb8d1fd79d1d']
I think this shows that all nodes (2 in my case) can see each other. However, I should have 4 GPUs listed (2 nodes of 2 GPUs each). Does this affect things?
Also, when I run your second snippet, I get:
361e04dc0722ba357d9fa31db01682f56cb1e81e0cf19955dae0e94c
None
I also noticed that I have this in my error output, which may be relevant:
[2023-03-01 09:38:30,168 I 3213126 3213126] global_state_accessor.cc:356: This node has an IP address of 192.168.128.34, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
I've also have tried running this on one node, i.e. the head and worker node are the same, but the error still persists.
Hey @dangalea , I think the error output you shared may be relevant. Can you try running the following on your head node?
import ray
ray.init()
print(f"Current IP: {ray.util.get_node_ip_address()}")
print(f"Current Node ID: {ray.get_runtime_context().get_node_id()}")
print(f"Nodes: {ray.nodes()}")
cc @jjyao
Hey @matthewdeng, this is what I get:
Current IP: 192.168.128.34
Current Node ID: 420ee8af5b8c66e85474afbdbc13c4bb9eb1bc06138f737d30871bcc
Nodes: [{'NodeID': '420ee8af5b8c66e85474afbdbc13c4bb9eb1bc06138f737d30871bcc', 'Alive': True, 'NodeManagerAddress': 'pascal35', 'NodeManagerHostname': 'pascal35', 'NodeManagerPort': 44637, 'ObjectManagerPort': 46101, 'ObjectStoreSocketName': '/var/tmp/galea1/ray/session_2023-03-01_10-46-31_212388_3224263/sockets/plasma_store', 'RayletSocketName': '/var/tmp/galea1/ray/session_2023-03-01_10-46-31_212388_3224263/sockets/raylet', 'MetricsExportPort': 47246, 'NodeName': 'pascal35', 'alive': True, 'Resources': {'object_store_memory': 80584793702.0, 'GPU': 2.0, 'CPU': 72.0, 'node:pascal35': 1.0, 'memory': 178031185306.0, 'accelerator_type:P100': 1.0}}, {'NodeID': 'b74500a5af9952196fc6a294cfa984e6212a4ed51d469e287a7a7dfd', 'Alive': True, 'NodeManagerAddress': '192.168.128.35', 'NodeManagerHostname': 'pascal36', 'NodeManagerPort': 41949, 'ObjectManagerPort': 43949, 'ObjectStoreSocketName': '/var/tmp/galea1/ray/session_2023-03-01_10-46-31_212388_3224263/sockets/plasma_store', 'RayletSocketName': '/var/tmp/galea1/ray/session_2023-03-01_10-46-31_212388_3224263/sockets/raylet', 'MetricsExportPort': 46173, 'NodeName': '192.168.128.35', 'alive': True, 'Resources': {'accelerator_type:P100': 1.0, 'memory': 188305381376.0, 'object_store_memory': 80702306304.0, 'GPU': 2.0, 'CPU': 72.0, 'node:192.168.128.35': 1.0}}]
Hmm yeah seems like it's because the NodeManagerAddress
is pascal35
(which seems to be the host name?) here rather than the IP address.
Head Node: 'NodeManagerAddress': 'pascal35', 'NodeManagerHostname': 'pascal35'
Worker Node: 'NodeManagerAddress': '192.168.128.35', 'NodeManagerHostname': 'pascal36'
@jjyao can you take a look at this and see if NodeManagerAddress
should be the IP address instead? Or if the current output is expected, should the logic to map IP to nodeId be changed? https://github.com/ray-project/ray/blob/a892241ca7574af47f278a667e6493a4b03686d7/python/ray/air/util/node.py#L5-L11
NodeManagerAddress
should be IP. On the head node, @dangalea could you search for
RAY_LOG(INFO) << "Raylet of id, " << self_node_id_
<< " started. Raylet consists of node_manager and object_manager."
<< " node_manager address: " << self_node_info_.node_manager_address()
<< ":" << self_node_info_.node_manager_port()
<< " object_manager address: " << self_node_info_.node_manager_address()
<< ":" << self_node_info_.object_manager_port()
<< " hostname: " << self_node_info_.node_manager_hostname();
in /tmp/ray/session_latest/logs/raylet.out
Also could you show the full command of raylet
process on the head node via ps aux | grep raylet
@jjyao, I could not find the file at /tmp/ray/session_latest/logs/raylet.out
, however this is what I get for ps aux | grep raylet
:
galea1 3228257 1.5 0.0 82858816 29048 ? Sl 11:33 0:00 /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet --store_socket_name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=pascal35 --maximum_startup_concurrency=72 --static_resource_list=node:pascal35,1.0,accelerator_type:P100,1,CPU,72,GPU,2,memory,178010839655,object_store_memory,80576074137 --python_worker_command=/usr/workspace/galea1/conda_envs/envs/tracking/bin/python /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/_private/workers/setup_worker.py /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/_private/workers/default_worker.py --node-ip-address=pascal35 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store --raylet-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet --redis-address=None --temp-dir=/var/tmp/galea1/ray --metrics-agent-port=61074 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=pascal35:6379 --session-name=session_2023-03-01_11-32-59_368536_3227792 --temp-dir=/var/tmp/galea1/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=cf14098f-f540-43fd-8436-2333f52d04a8 --java_worker_command=/usr/workspace/galea1/conda_envs/envs/tracking/bin/python /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/_private/workers/setup_worker.py -Dray.address=pascal35:6379 -Dray.raylet.node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER -Dray.object-store.socket-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store -Dray.raylet.socket-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet -Dray.redis.password=cf14098f-f540-43fd-8436-2333f52d04a8 -Dray.node-ip=pascal35 -Dray.home=/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/../.. -Dray.logging.dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/logs -Dray.session-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER io.ray.runtime.runner.worker.DefaultWorker --cpp_worker_command= --native_library_path=/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/cpp/lib --temp_dir=/var/tmp/galea1/ray --session_dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792 --log_dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/logs --resource_dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/runtime_resources --metrics-agent-port=61074 --metrics_export_port=62332 --object_store_memory=80576074137 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=pascal35:6379 --session-name=session_2023-03-01_11-32-59_368536_3227792 --agent_command=/usr/workspace/galea1/conda_envs/envs/tracking/bin/python -u /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/dashboard/agent.py --node-ip-address=pascal35 --metrics-export-port=62332 --dashboard-agent-port=61074 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store --raylet-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet --temp-dir=/var/tmp/galea1/ray --session-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792 --runtime-env-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/runtime_resources --log-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-03-01_11-32-59_368536_3227792 --gcs-address=pascal35:6379 --minimal --node-name=pascal35
galea1 3228490 4.1 0.0 3110140 96480 ? Sl 11:33 0:00 /usr/workspace/galea1/conda_envs/envs/tracking/bin/python -u /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/dashboard/agent.py --node-ip-address=pascal35 --metrics-export-port=62332 --dashboard-agent-port=61074 --listen-port=52365 --node-manager-port=36579 --object-store-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store --raylet-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet --temp-dir=/var/tmp/galea1/ray --session-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792 --runtime-env-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/runtime_resources --log-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-03-01_11-32-59_368536_3227792 --gcs-address=pascal35:6379 --minimal --agent-id 1059961393
Ah, could you set the head_node_ip
as described in the SLURM documentation?
Thanks @dangalea !
--node_ip_address=pascal35
this is already wrong.
You mentioned that you used the following command to start the head node:
# Launch the head node
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2
What's the value of $1
? Is it pascal35
or an ip?
@jjyao, $1
is pascal35
.
I have taken @matthewdeng's advice and reformulated my submission script. This is now:
#!/bin/bash
#SBATCH --job-name=ray
#SBATCH --output=ray.out
#SBATCH --error=ray.err
#SBATCH --time=24:00:00
#SBATCH --partition=pbatch
#SBATCH -A cbronze
### This script works for any number of nodes, Ray will find and manage all resources
#SBATCH --ntasks=4
### Give all resources to a single Ray task, ray can manage the resources internally
#SBATCH --ntasks-per-node=2
##SBATCH --gpus-per-task=2
###SBATCH --cpus-per-task=36
. /usr/workspace/galea1/anaconda3/etc/profile.d/conda.sh
conda activate tracking
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" --port=$port \
--block &
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo "Starting WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w "$node_i" \
ray start --address "$ip_head" \
--block &
sleep 5
done
python mnist.py --cuda
This solves my initial problem, but now any node which is not the head node is not being used by ray. Would you know what might be the problem?
but now any node which is not the head node is not being used by ray.
You mean the ray cluster only contains the head node but no worker nodes? How did you realize that? What's the output of ray status
?
Not quite. Ray is available in both the head node and worker node. ray status
on the head node returns this error:
(base) [galea1@pascal35:bin]$ ./ray status
Traceback (most recent call last):
File "/usr/WS2/galea1/conda_envs/envs/tracking/bin/./ray", line 8, in <module>
sys.exit(main())
File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2422, in main
return cli()
File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1907, in status
address = services.canonicalize_bootstrap_address_or_die(address)
File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/_private/services.py", line 541, in canonicalize_bootstrap_address_or_die
raise ConnectionError(
ConnectionError: Found multiple active Ray instances: {'192.168.128.34:6379', '192.168.128.34:62451'}. Please specify the one to connect to by setting the `--address` flag or `RAY_ADDRESS` environment variable.
ray status
on the worker node returns:
(base) [galea1@pascal36:bin]$ ./ray status
======== Autoscaler status: 2023-03-01 13:42:49.214032 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_e3379008830a99ac39b8c8efe715e72ae8e1a21231b0fa969aac275e
1 node_3656fe502fa56d0f5b12988ee02debc2bb3baf6834f154bcc08e08fe
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/144.0 CPU
0.0/4.0 GPU
0.0/2.0 accelerator_type:P100
0.00/341.077 GiB memory
0.00/150.167 GiB object_store_memory
Demands:
(no resource demands)
but the worker node is not being used. Could this be a mismatch in ports?
ConnectionError: Found multiple active Ray instances: {'192.168.128.34:6379', '192.168.128.34:62451'}. Please specify the one to connect to by setting the
--addressflag or
RAY_ADDRESSenvironment variable.
You started multiple ray instances on the same head machine? Is it because you didn't clean up the old ones? Could you stop everything and restart the ray cluster?
I have checked that I have not had any stale instances. Could my script be starting two instances at the same time?
I think I might know what the problem is:
In your Ray application, could you change ray.init()
to ray.init(address="auto")
. Currently there is a bug that if you call ray.init()
it will create a new single node cluster instead of connecting to an existing cluster.
That seems to work now. I did not have ray.init()
before. When I add it and I run ray status
on either the head node or the worker node, I get the following:
(base) [galea1@pascal35:bin]$ ./ray status
======== Autoscaler status: 2023-03-01 16:47:37.843071 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_0bf2868539ecac836c61a352462452b953710b904e9f6b0b9d4b25c1
1 node_9141273b0ca7727b27c33f7d427040275356cfc7a2ea43685c400099
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
8.0/144.0 CPU (8.0 used of 8.0 reserved in placement groups)
4.0/4.0 GPU (4.0 used of 4.0 reserved in placement groups)
0.0/2.0 accelerator_type:P100
0.00/341.044 GiB memory
0.00/150.153 GiB object_store_memory
Demands:
(no resource demands)
However, I am still concerned that the accelerator is not being used. The GPU utilisation rate is at 2% across all 4 GPUs.
Glad to hear that it's working.
@matthewdeng @justinvyu could you take over from here for the GPU utilization issue?
Awesome!
For the GPU issue, can you confirm what you're running?
From the original script, it looks like you are setting gpus_per_trial=0
, but from the ray status
output it looks like GPUs are reserved?
Yes I'd changed that to gpus_per_trail=1
. From nvidia-smi
, the GPUs' memory is being used but the usage percentage seems too low:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:04:00.0 Off | 0 |
| N/A 32C P0 31W / 250W | 880MiB / 16384MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:07:00.0 Off | 0 |
| N/A 29C P0 32W / 250W | 880MiB / 16384MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 331262 C ray::ImplicitFunc.train 878MiB |
| 1 N/A N/A 331448 C ray::ImplicitFunc.train 878MiB |
+-----------------------------------------------------------------------------+
Okay, that's good to know. In that case I think the most likely reason for this is that the script is a bit of a "toy example" (if you ran the same PyTorch training code without Ray you would see similar GPU utilization).
Some potential ways to see higher GPU utilization are to increase the complexity of the model, or to increase batch size.
Ok thanks for that, I'll try changing some parameters around. Thanks for all your guys' help.
Hello everyone,
Sorry to be reopening an issue, but I'm running in to the same error regardless of implementing the above mentioned solution...
When running an example very similar to the distributed ResNet50 PyTorch on an HPC cluster with Slingshot interconnect and the rayproject/ray-ml:2.3.1-py39-cu116
container, the ray cluster exits with an 'NoneType' object has no attribute 'hex'
error:
Failure # 1 (occurred at 2023-04-27_06-22-29)
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1544, in stop_trial
self._callbacks.on_trial_complete(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/callback.py", line 360, in on_trial_complete
callback.on_trial_complete(**info)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/syncer.py", line 731, in on_trial_complete
self._sync_trial_dir(trial, force=True, wait=True)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/syncer.py", line 685, in _sync_trial_dir
sync_process.wait()
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/syncer.py", line 237, in wait
raise exception
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/syncer.py", line 200, in entrypoint
result = self._fn(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
return _sync_dir_between_different_nodes(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 175, in _sync_dir_between_different_nodes
num_cpus=0, **_force_on_node(target_node_id)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/util/node.py", line 35, in _force_on_node
scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/scheduling_strategies.py", line 61, in __init__
node_id = node_id.hex()
AttributeError: 'NoneType' object has no attribute 'hex'
This happens regardless of the fact that all ray processes are on the same node or not (at least the worker ones)
In my case, I set the ip address to the Slingshot IP on the head node with the following line:
head_ip_address=$( ip -f inet addr show hsn0 | egrep -m 1 -o 'inet [0-9.]{1,}' | sed 's/inet //' )
ray start --head --node-ip-address=$head_ip_addres ...
The cluster is verified by ssh'ing to the assigned nodes and running ray status
which returns on all nodes (for example):
======== Autoscaler status: 2023-04-27 05:08:04.778682 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_446af567b5feda490ed5d1a2a44ae59ac51308e9620c1f7283314593
1 node_88ecddf9c244790b85d086375bad4e8871549d0fdbe65e7ad4343233
1 node_ca33a52732fc431c596842d423f90357814a0f07186ecde9a556af9f
1 node_94c571d20ed87dbd7e64191d2846be42368225da0a1ccf296bd768bd
1 node_bbd25329f4a4023e37ec20e15d8a3c7dde89421c4bd55f6180eee808
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/35.0 CPU
0.0/4.0 GPU
0.0/5.0 accelerator_type:A100
0.00/1729.404 GiB memory
0.00/745.165 GiB object_store_memory
Demands:
(no resource demands)
Running the example via python3 my_script.py
the script seems to training on multiple GPUs when issuing nvidia-smi
:
Thu Apr 27 15:22:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:03:00.0 Off | 0 |
| N/A 45C P0 260W / 400W | 5409MiB / 40960MiB | 46% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 |
| N/A 43C P0 96W / 400W | 5385MiB / 40960MiB | 38% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:81:00.0 Off | 0 |
| N/A 46C P0 93W / 400W | 5409MiB / 40960MiB | 54% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | 0 |
| N/A 47C P0 245W / 400W | 5385MiB / 40960MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3756461 C ...._RayTrainWorker__execute 5406MiB |
| 1 N/A N/A 3756462 C ...._RayTrainWorker__execute 5382MiB |
| 2 N/A N/A 3756460 C ...._RayTrainWorker__execute 5406MiB |
| 3 N/A N/A 3756459 C ...._RayTrainWorker__execute 5382MiB |
+-----------------------------------------------------------------------------+
But only until the first epoch is finished, when it exits with:
...
== Status ==
Current time: 2023-04-27 06:22:21 (running for 00:01:12.14)
Memory usage on this node: 28.0/502.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 33.0/35 CPUs, 4.0/4 GPUs, 0.0/1715.33 GiB heap, 0.0/739.13 GiB objects (0.0/5.0 accelerator_type:A100)
Result logdir: ${RAY_TMPDIR}/ray_results/TorchTrainer_2023-04-27_06-21-09
Number of trials: 1/1 (1 RUNNING)
+--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------+
| Trial name | status | loc | iter | total time (s) | loss | _timestamp | _time_this_iter_s |
|--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------|
| TorchTrainer_5f9b6_00000 | RUNNING | 10.150.0.31:3756296 | 9 | 68.0712 | 2.33203 | 1682601741 | 4.8599 |
+--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------+
...
== Status ==
Current time: 2023-04-27 06:22:29 (running for 00:01:19.93)
Memory usage on this node: 27.1/502.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/35 CPUs, 0/4 GPUs, 0.0/1715.33 GiB heap, 0.0/739.13 GiB objects (0.0/5.0 accelerator_type:A100)
Result logdir: ${RAY_TMPDIR}/ray_results/TorchTrainer_2023-04-27_06-21-09
Number of trials: 1/1 (1 ERROR)
+--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------+
| Trial name | status | loc | iter | total time (s) | loss | _timestamp | _time_this_iter_s |
|--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------|
| TorchTrainer_5f9b6_00000 | ERROR | 10.150.0.31:3756296 | 10 | 73.0525 | 2.33749 | 1682601746 | 4.91388 |
+--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------+
Number of errored trials: 1
+--------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|--------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------|
| TorchTrainer_5f9b6_00000 | 1 | ${RAY_TMPDIR}/ray_results/TorchTrainer_2023-04-27_06-21-09/TorchTrainer_5f9b6_00000_0_2023-04-27_06-21-10/error.txt |
+--------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------+
Ray cluster is brought up using PBS Pro mpiexec
command which (depending on the MPI rank) starts off head, worker and submit jobs, where the process tree looks like:
PID TTY STAT TIME COMMAND
3757327 ? S 0:00 sshd: my_username@notty
3757328 ? Rs 0:00 \_ ps f -u my_username
3752957 ? Ss 0:00 -bash
3753025 ? S 0:00 \_ /bin/bash /var/spool/pbs/mom_priv/jobs/7963.login-node.SC
3753050 ? S 0:00 \_ /bin/bash ./ray-launcher.sh pytorch.py -n 4 --use-gpu True
3753066 ? Sl 0:00 \_ mpiexec -np 6 ./ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753069 ? Ss 0:00 /usr/sbin/palsd
3753072 ? S 0:00 \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753077 ? S 0:00 | \_ /bin/bash ./ray-head.sh
3753108 ? Sl 0:00 | \_ Apptainer runtime parent
3753120 ? Sl 0:00 | \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --head --block --include-dashboard False --port=34438 --node-ip-address=10.150.0.31 --node-manager-port=43925 --object-manager-port=33464 --ray-client-server-port=45596 --redis-shard-ports= --min-worker-port=50014 --max-worker-port=50114 --log-style=record --num-gpus=0 --num-cpus 3
3753144 ? Sl 0:01 | | \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --config_list=eyJvYmplY3Rfc3BpbGxpbmdfY29uZmlnIjogIntcInR5cGVcIjogXCJmaWxlc3lzdGVtXCIsIFwicGFyYW1zXCI6IHtcImRpcmVjdG9yeV9wYXRoXCI6IFwiL2x1c3RyZS9ob21lL21rdmFraWMvcmF5L3Nlc3Npb25fMjAyMy0wNC0yN18wNi0yMC0zN184MjMzMzlfMzc1MzEyMFwifX0iLCAiaXNfZXh0ZXJuYWxfc3RvcmFnZV90eXBlX2ZzIjogdHJ1ZX0= --gcs_server_port=34438 --metrics-agent-port=46221 --node-ip-address=10.150.0.31 --session-name=session_2023-04-27_06-20-37_823339_3753120
3753292 ? Sl 0:00 | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --monitor-ip=10.150.0.31
3753365 ? Sl 0:00 | | \_ /home/ray/anaconda3/bin/python -m ray.util.client.server --address=10.150.0.31:34438 --host=0.0.0.0 --port=45596 --mode=proxy --metrics-agent-port=46221
3753438 ? Sl 0:00 | | \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/dashboard.py --host=localhost --port=8265 --port-retries=0 --temp-dir=${RAY_TMPDIR}/ray --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --modules-to-load=UsageStatsHead --disable-frontend
3753656 ? Sl 0:00 | | \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --object_manager_port=33464 --min_worker_port=50014 --max_worker_port=50114 --node_manager_port=43925 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=3 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,3,memory,360757881447,object_store_memory,158896234905 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=46221 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=46221 --metrics_export_port=39641 --object_store_memory=158896234905 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=39641 --dashboard-agent-port=46221 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --node-name=10.150.0.31
3753908 ? Sl 0:01 | | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=39641 --dashboard-agent-port=46221 --listen-port=52365 --node-manager-port=43925 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756296 ? SNl 0:02 | | | \_ ray::_Inner.train
3753729 ? Sl 0:00 | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3753131 ? Sl 0:02 | \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753073 ? S 0:00 \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753078 ? S 0:00 | \_ /bin/bash ./ray-submit.sh pytorch.py -n 4 --use-gpu True
3756060 ? Sl 0:00 | \_ Apptainer runtime parent
3756075 ? Sl 0:02 | \_ /home/ray/anaconda3/bin/python3 pytorch.py -n 4 --use-gpu True
3756086 ? Sl 0:01 | \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753074 ? S 0:00 \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753080 ? S 0:00 | \_ /bin/bash ./ray-worker.sh
3754116 ? Sl 0:00 | \_ Apptainer runtime parent
3754130 ? Sl 0:00 | \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --block --num-cpus 8 --log-style=record --address 10.150.0.31:34438
3754214 ? Sl 0:00 | | \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=8 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,8,GPU,1,memory,370563281716,object_store_memory,158812835020 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=64793 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=64793 --metrics_export_port=61063 --object_store_memory=158812835020 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=61063 --dashboard-agent-port=64793 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438
3754427 ? Sl 0:02 | | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=61063 --dashboard-agent-port=64793 --listen-port=52365 --node-manager-port=43831 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756462 ? SNl 0:34 | | | \_ ray::RayTrainWorker._RayTrainWorker__execute
3754287 ? Sl 0:00 | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3754141 ? Sl 0:08 | \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753075 ? S 0:00 \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753081 ? S 0:00 | \_ /bin/bash ./ray-worker.sh
3754578 ? Sl 0:00 | \_ Apptainer runtime parent
3754593 ? Sl 0:00 | \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --block --num-cpus 8 --log-style=record --address 10.150.0.31:34438
3754659 ? Sl 0:00 | | \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=8 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,8,GPU,1,memory,370365912269,object_store_memory,158728248115 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=59363 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=59363 --metrics_export_port=48326 --object_store_memory=158728248115 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=48326 --dashboard-agent-port=59363 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438
3754872 ? Sl 0:02 | | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=48326 --dashboard-agent-port=59363 --listen-port=52365 --node-manager-port=36567 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756459 ? SNl 0:34 | | | \_ ray::RayTrainWorker._RayTrainWorker__execute
3754732 ? Sl 0:00 | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3754604 ? Sl 0:08 | \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753076 ? S 0:00 \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753087 ? S 0:00 | \_ /bin/bash ./ray-worker.sh
3755024 ? Sl 0:00 | \_ Apptainer runtime parent
3755039 ? Sl 0:00 | \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --block --num-cpus 8 --log-style=record --address 10.150.0.31:34438
3755105 ? Sl 0:00 | | \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=8 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,8,GPU,1,memory,370169099060,object_store_memory,158643899596 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=59054 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=59054 --metrics_export_port=62448 --object_store_memory=158643899596 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=62448 --dashboard-agent-port=59054 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438
3755318 ? Sl 0:02 | | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=62448 --dashboard-agent-port=59054 --listen-port=52365 --node-manager-port=43811 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756461 ? SNl 0:34 | | | \_ ray::RayTrainWorker._RayTrainWorker__execute
3755178 ? Sl 0:00 | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3755050 ? Sl 0:08 | \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753079 ? S 0:00 \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753088 ? S 0:00 \_ /bin/bash ./ray-worker.sh
3755483 ? Sl 0:00 \_ Apptainer runtime parent
3755498 ? Sl 0:00 \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --block --num-cpus 8 --log-style=record --address 10.150.0.31:34438
3755564 ? Sl 0:00 | \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=8 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,8,GPU,1,memory,369969616487,object_store_memory,158558407065 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=64035 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=64035 --metrics_export_port=41579 --object_store_memory=158558407065 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=41579 --dashboard-agent-port=64035 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438
3755906 ? Sl 0:02 | | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=41579 --dashboard-agent-port=64035 --listen-port=52365 --node-manager-port=33413 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756201 ? SNl 0:00 | | \_ ray::IDLE
3756460 ? SNl 0:34 | | \_ ray::RayTrainWorker._RayTrainWorker__execute
3755637 ? Sl 0:00 | \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3755509 ? Sl 0:08 \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3755735 ? Ss 0:00 /usr/lib/systemd/systemd --user
3755737 ? S 0:00 \_ (sd-pam)
UPDATE - The problems seems not to be there if the first network interface is used. But this is not an option, as it is used for system services and has far lower bandwidth and latency
UPDATE^2 - For anyone who stumbles upon the same issue: The problem was fixed by initializing ray with:
...
ray.init(address='auto', _node_ip_address=os.environ['NODE_IP_ADDRESS'])
...
Where NODE_IP_ADDRESS
corresponds to the ip adress on the network interface used (in my case hsn0
)
What happened + What you expected to happen
I am trying to run the initial example from the ray tune docs, substituting the MNIST dataset for the CIFAR dataset. I am trying to run this on an HPC cluster using SLURM. My expected return is the final results of the hyperparameter optimisation, but I am getting the following error:
Given that the final line of the stacktrace is coming from NodeAffinitySchedulingStrategy, I have tried both AHSB and Hyperband strategies but the same error still occurs. Would you know what the issue might be?
Versions / Dependencies
Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_kmp_llvm conda-forge aiosignal 1.3.1 pypi_0 pypi attrs 22.2.0 pypi_0 pypi blas 1.0 mkl
bottleneck 1.3.5 py310ha9d4c09_0 anaconda brotli 1.0.9 h5eee18b_7
brotli-bin 1.0.9 h5eee18b_7
brotlipy 0.7.0 py310h7f8727e_1002
bzip2 1.0.8 h7b6447c_0
c-ares 1.18.1 h7f8727e_0
ca-certificates 2022.12.7 ha878542_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cartopy 0.18.0 py310h95ad73f_2
cdsapi 0.5.1 pypi_0 pypi certifi 2022.12.7 pyhd8ed1ab_0 conda-forge cf-plot 3.1.28 pyhd8ed1ab_0 conda-forge cf-python 3.13.1 py310h5764c6d_0 conda-forge cfdm 1.9.0.4 py310hff52083_1 conda-forge cffi 1.15.1 py310h74dc2b5_0
cftime 1.6.2 py310hde88566_1 conda-forge cfunits 3.3.5 pyhd8ed1ab_0 conda-forge charset-normalizer 2.0.4 pyhd3eb1b0_0
click 8.1.3 pypi_0 pypi cloudpickle 2.2.0 pypi_0 pypi cryptography 38.0.1 py310h9ce1e76_0
cuda 11.7.1 0 nvidia cuda-cccl 11.7.91 0 nvidia cuda-command-line-tools 11.7.1 0 nvidia cuda-compiler 11.7.1 0 nvidia cuda-cudart 11.7.99 0 nvidia cuda-cudart-dev 11.7.99 0 nvidia cuda-cuobjdump 11.7.91 0 nvidia cuda-cupti 11.7.101 0 nvidia cuda-cuxxfilt 11.7.91 0 nvidia cuda-demo-suite 11.8.86 0 nvidia cuda-documentation 11.8.86 0 nvidia cuda-driver-dev 11.7.99 0 nvidia cuda-gdb 11.8.86 0 nvidia cuda-libraries 11.7.1 0 nvidia cuda-libraries-dev 11.7.1 0 nvidia cuda-memcheck 11.8.86 0 nvidia cuda-nsight 11.8.86 0 nvidia cuda-nsight-compute 11.8.0 0 nvidia cuda-nvcc 11.7.99 0 nvidia cuda-nvdisasm 11.8.86 0 nvidia cuda-nvml-dev 11.7.91 0 nvidia cuda-nvprof 11.8.87 0 nvidia cuda-nvprune 11.7.91 0 nvidia cuda-nvrtc 11.7.99 0 nvidia cuda-nvrtc-dev 11.7.99 0 nvidia cuda-nvtx 11.7.91 0 nvidia cuda-nvvp 11.8.87 0 nvidia cuda-runtime 11.7.1 0 nvidia cuda-sanitizer-api 11.8.86 0 nvidia cuda-toolkit 11.7.1 0 nvidia cuda-tools 11.7.1 0 nvidia cuda-visual-tools 11.7.1 0 nvidia curl 7.85.0 h5eee18b_0
cycler 0.11.0 pyhd3eb1b0_0
dbus 1.13.18 hb2f20db_0
distlib 0.3.6 pypi_0 pypi esmf 8.4.0 mpi_mpich_h5a1934d_101 conda-forge esmpy 8.4.0 mpi_mpich_py310h515c5ea_101 conda-forge expat 2.4.9 h6a678d5_0
ffmpeg 4.3 hf484d3e_0 pytorch fftw 3.3.9 h27cfd23_1
filelock 3.9.0 pypi_0 pypi fontconfig 2.13.1 h6c09931_0
fonttools 4.25.0 pyhd3eb1b0_0
freetype 2.12.1 h4a9f257_0
frozenlist 1.3.3 pypi_0 pypi gds-tools 1.4.0.31 0 nvidia geos 3.8.0 he6710b0_0
giflib 5.2.1 h7b6447c_0
glib 2.69.1 h4ff587b_1
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
grpcio 1.51.3 pypi_0 pypi gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
h5py 3.7.0 nompi_py310h416281c_102 conda-forge hdf4 4.2.15 h9772cbc_5 conda-forge hdf5 1.12.2 mpi_mpich_h08b82f9_0 conda-forge icu 58.2 he6710b0_3
idna 3.4 py310h06a4308_0
intel-openmp 2021.4.0 h06a4308_3561
joblib 1.1.0 pyhd3eb1b0_0 anaconda jpeg 9e h7f8727e_0
jsonschema 4.17.3 pypi_0 pypi kiwisolver 1.4.2 py310h295c915_0
krb5 1.19.2 hac12032_0
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libbrotlicommon 1.0.9 h5eee18b_7
libbrotlidec 1.0.9 h5eee18b_7
libbrotlienc 1.0.9 h5eee18b_7
libclang 10.0.1 default_hb85057a_2
libcublas 11.11.3.6 0 nvidia libcublas-dev 11.11.3.6 0 nvidia libcufft 10.9.0.58 0 nvidia libcufft-dev 10.9.0.58 0 nvidia libcufile 1.4.0.31 0 nvidia libcufile-dev 1.4.0.31 0 nvidia libcurand 10.3.0.86 0 nvidia libcurand-dev 10.3.0.86 0 nvidia libcurl 7.85.0 h91b91d3_0
libcusolver 11.4.1.48 0 nvidia libcusolver-dev 11.4.1.48 0 nvidia libcusparse 11.7.5.86 0 nvidia libcusparse-dev 11.7.5.86 0 nvidia libdeflate 1.8 h7f8727e_5
libedit 3.1.20210910 h7f8727e_0
libev 4.33 h7f8727e_1
libevent 2.1.12 h8f2d780_0
libffi 3.3 he6710b0_2
libgcc-ng 12.2.0 h65d4601_19 conda-forge libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libiconv 1.16 h7f8727e_2
libidn2 2.3.2 h7f8727e_0
libllvm10 10.0.1 hbcb73fb_5
libnetcdf 4.8.1 mpi_mpich_h06c54e2_4 conda-forge libnghttp2 1.46.0 hce63b2e_0
libnpp 11.8.0.86 0 nvidia libnpp-dev 11.8.0.86 0 nvidia libnvjpeg 11.9.0.86 0 nvidia libnvjpeg-dev 11.9.0.86 0 nvidia libpng 1.6.37 hbc83047_0
libpq 12.9 h16c4e8d_3
libssh2 1.10.0 h8f2d780_0
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge libtasn1 4.16.0 h27cfd23_0
libtiff 4.4.0 hecacb30_0
libunistring 0.9.10 h27cfd23_0
libuuid 1.0.3 h7f8727e_2
libwebp 1.2.4 h11a3e52_0
libwebp-base 1.2.4 h5eee18b_0
libxcb 1.15 h7f8727e_0
libxkbcommon 1.0.1 hfa300c1_0
libxml2 2.9.14 h74e7548_0
libxslt 1.1.35 h4e12654_0
libzip 1.9.2 hc869a4a_1 conda-forge libzlib 1.2.13 h166bdaf_4 conda-forge llvm-openmp 14.0.6 h9e868ea_0
lz4-c 1.9.3 h295c915_1
matplotlib 3.5.2 py310h06a4308_0
matplotlib-base 3.5.2 py310hf590b9c_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py310h7f8727e_0
mkl_fft 1.3.1 py310hd6ae3a3_0
mkl_random 1.2.2 py310h00e6091_0
mpi 1.0 mpich conda-forge mpi4py 3.1.4 py310h37cc914_0 conda-forge mpich 4.0.3 h846660c_100 conda-forge msgpack 1.0.4 pypi_0 pypi munkres 1.1.4 py_0
ncurses 6.3 h5eee18b_3
netcdf-flattener 1.2.0 pyh9f0ad1d_0 conda-forge netcdf-fortran 4.6.0 mpi_mpich_hd09bd1e_1 conda-forge netcdf4 1.6.2 nompi_py310h55e1e36_100 conda-forge nettle 3.7.3 hbbd107a_1
nsight-compute 2022.3.0.22 0 nvidia nspr 4.33 h295c915_0
nss 3.74 h0370c37_0
numexpr 2.8.3 py310hcea2de6_0 anaconda numpy 1.23.3 py310hd5efca6_0
numpy-base 1.23.3 py310h8e6c178_0
opencv-python-headless 4.6.0.66 pypi_0 pypi openh264 2.1.1 h4ff587b_0
openssl 1.1.1s h0b41bf4_1 conda-forge packaging 21.3 pyhd3eb1b0_0
pandas 1.4.3 py310h6a678d5_0 anaconda parallelio 2.5.9 mpi_mpich_h50e6f33_101 conda-forge pcre 8.45 h295c915_0
pillow 9.2.0 py310hace64e9_1
pip 22.2.2 py310h06a4308_0
platformdirs 3.0.0 pypi_0 pypi ply 3.11 py310h06a4308_0
proj 7.2.0 h277dcde_2 conda-forge protobuf 3.20.1 pypi_0 pypi psutil 5.9.4 py310h5764c6d_0 conda-forge pycparser 2.21 pyhd3eb1b0_0
pyopenssl 22.0.0 pyhd3eb1b0_0
pyparsing 3.0.9 py310h06a4308_0
pyqt 5.15.7 py310h6a678d5_1
pyqt5-sip 12.11.0 pypi_0 pypi pyrsistent 0.19.3 pypi_0 pypi pyshp 2.3.1 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 py310h06a4308_0
python 3.10.0 h12debd9_5
python-dateutil 2.8.2 pyhd3eb1b0_0
python_abi 3.10 2_cp310 conda-forge pytorch 1.13.0 py3.10_cuda11.7_cudnn8.5.0_0 pytorch pytorch-cuda 11.7 h67b0de4_0 pytorch pytorch-model-summary 0.1.1 py_0 conda-forge pytorch-mutex 1.0 cuda pytorch pytz 2022.1 py310h06a4308_0 anaconda pyyaml 6.0 pypi_0 pypi qt-main 5.15.2 h327a75a_7
qt-webengine 5.15.9 hd2b0992_4
qtwebkit 5.212 h4eab89a_4
ray 2.3.0 pypi_0 pypi readline 8.2 h5eee18b_0
requests 2.28.1 py310h06a4308_0
scikit-learn 1.1.1 py310h6a678d5_0 anaconda scipy 1.9.1 py310hd5efca6_0
setuptools 65.5.0 py310h06a4308_0
shapely 1.8.4 py310h81ba7c5_0
sip 6.6.2 py310h6a678d5_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.39.3 h5082296_0
tabulate 0.9.0 pypi_0 pypi tempest-extremes 2.2.1 mpi_mpich_h9b66f1e_0 conda-forge tensorboardx 2.5.1 pypi_0 pypi threadpoolctl 2.2.0 pyh0d69192_0 anaconda tk 8.6.12 h1ccaba5_0
toml 0.10.2 pyhd3eb1b0_0
torch-metrics 1.1.7 pypi_0 pypi torch-summary 1.4.5 pypi_0 pypi torchaudio 0.13.0 py310_cu117 pytorch torchmetrics 0.11.0 pypi_0 pypi torchvision 0.14.0 py310_cu117 pytorch tornado 6.2 py310h5eee18b_0
tqdm 4.64.1 py310h06a4308_0
typing_extensions 4.3.0 py310h06a4308_0
tzdata 2022e h04d1e81_0
udunits2 2.2.28 hc3e0081_0 conda-forge urllib3 1.26.12 py310h06a4308_0
virtualenv 20.19.0 pypi_0 pypi wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.6 h5eee18b_0
yacs 0.1.8 pypi_0 pypi yaml 0.2.5 h7b6447c_0 anaconda zlib 1.2.13 h166bdaf_4 conda-forge zstd 1.5.2 ha4553b6_0
Reproduction script
My python script is as follows:
I am running this python script using the following SLURM script:
I am starting my head node using:
and my worker nodes using:
Issue Severity
High: It blocks me from completing my task.