ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.9k stars 5.76k forks source link

[Tune] 'NoneType' object has no attribute 'hex' #32921

Closed dangalea closed 1 year ago

dangalea commented 1 year ago

What happened + What you expected to happen

I am trying to run the initial example from the ray tune docs, substituting the MNIST dataset for the CIFAR dataset. I am trying to run this on an HPC cluster using SLURM. My expected return is the final results of the hyperparameter optimisation, but I am getting the following error:

Traceback (most recent call last):
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 1544, in stop_trial
    self._callbacks.on_trial_complete(
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/tune/callback.py", line 360, in on_trial_complete
    callback.on_trial_complete(**info)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/tune/syncer.py", line 731, in on_trial_complete
    self._sync_trial_dir(trial, force=True, wait=True)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/tune/syncer.py", line 685, in _sync_trial_dir
    sync_process.wait()
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/tune/syncer.py", line 237, in wait
    raise exception
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/tune/syncer.py", line 200, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/tune/utils/file_transfer.py", line 175, in _sync_dir_between_different_nodes
    num_cpus=0, **_force_on_node(target_node_id)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/air/util/node.py", line 35, in _force_on_node
    scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/util/scheduling_strategies.py", line 61, in __init__
    node_id = node_id.hex()

Given that the final line of the stacktrace is coming from NodeAffinitySchedulingStrategy, I have tried both AHSB and Hyperband strategies but the same error still occurs. Would you know what the issue might be?

Versions / Dependencies

Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_kmp_llvm conda-forge aiosignal 1.3.1 pypi_0 pypi attrs 22.2.0 pypi_0 pypi blas 1.0 mkl
bottleneck 1.3.5 py310ha9d4c09_0 anaconda brotli 1.0.9 h5eee18b_7
brotli-bin 1.0.9 h5eee18b_7
brotlipy 0.7.0 py310h7f8727e_1002
bzip2 1.0.8 h7b6447c_0
c-ares 1.18.1 h7f8727e_0
ca-certificates 2022.12.7 ha878542_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cartopy 0.18.0 py310h95ad73f_2
cdsapi 0.5.1 pypi_0 pypi certifi 2022.12.7 pyhd8ed1ab_0 conda-forge cf-plot 3.1.28 pyhd8ed1ab_0 conda-forge cf-python 3.13.1 py310h5764c6d_0 conda-forge cfdm 1.9.0.4 py310hff52083_1 conda-forge cffi 1.15.1 py310h74dc2b5_0
cftime 1.6.2 py310hde88566_1 conda-forge cfunits 3.3.5 pyhd8ed1ab_0 conda-forge charset-normalizer 2.0.4 pyhd3eb1b0_0
click 8.1.3 pypi_0 pypi cloudpickle 2.2.0 pypi_0 pypi cryptography 38.0.1 py310h9ce1e76_0
cuda 11.7.1 0 nvidia cuda-cccl 11.7.91 0 nvidia cuda-command-line-tools 11.7.1 0 nvidia cuda-compiler 11.7.1 0 nvidia cuda-cudart 11.7.99 0 nvidia cuda-cudart-dev 11.7.99 0 nvidia cuda-cuobjdump 11.7.91 0 nvidia cuda-cupti 11.7.101 0 nvidia cuda-cuxxfilt 11.7.91 0 nvidia cuda-demo-suite 11.8.86 0 nvidia cuda-documentation 11.8.86 0 nvidia cuda-driver-dev 11.7.99 0 nvidia cuda-gdb 11.8.86 0 nvidia cuda-libraries 11.7.1 0 nvidia cuda-libraries-dev 11.7.1 0 nvidia cuda-memcheck 11.8.86 0 nvidia cuda-nsight 11.8.86 0 nvidia cuda-nsight-compute 11.8.0 0 nvidia cuda-nvcc 11.7.99 0 nvidia cuda-nvdisasm 11.8.86 0 nvidia cuda-nvml-dev 11.7.91 0 nvidia cuda-nvprof 11.8.87 0 nvidia cuda-nvprune 11.7.91 0 nvidia cuda-nvrtc 11.7.99 0 nvidia cuda-nvrtc-dev 11.7.99 0 nvidia cuda-nvtx 11.7.91 0 nvidia cuda-nvvp 11.8.87 0 nvidia cuda-runtime 11.7.1 0 nvidia cuda-sanitizer-api 11.8.86 0 nvidia cuda-toolkit 11.7.1 0 nvidia cuda-tools 11.7.1 0 nvidia cuda-visual-tools 11.7.1 0 nvidia curl 7.85.0 h5eee18b_0
cycler 0.11.0 pyhd3eb1b0_0
dbus 1.13.18 hb2f20db_0
distlib 0.3.6 pypi_0 pypi esmf 8.4.0 mpi_mpich_h5a1934d_101 conda-forge esmpy 8.4.0 mpi_mpich_py310h515c5ea_101 conda-forge expat 2.4.9 h6a678d5_0
ffmpeg 4.3 hf484d3e_0 pytorch fftw 3.3.9 h27cfd23_1
filelock 3.9.0 pypi_0 pypi fontconfig 2.13.1 h6c09931_0
fonttools 4.25.0 pyhd3eb1b0_0
freetype 2.12.1 h4a9f257_0
frozenlist 1.3.3 pypi_0 pypi gds-tools 1.4.0.31 0 nvidia geos 3.8.0 he6710b0_0
giflib 5.2.1 h7b6447c_0
glib 2.69.1 h4ff587b_1
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
grpcio 1.51.3 pypi_0 pypi gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
h5py 3.7.0 nompi_py310h416281c_102 conda-forge hdf4 4.2.15 h9772cbc_5 conda-forge hdf5 1.12.2 mpi_mpich_h08b82f9_0 conda-forge icu 58.2 he6710b0_3
idna 3.4 py310h06a4308_0
intel-openmp 2021.4.0 h06a4308_3561
joblib 1.1.0 pyhd3eb1b0_0 anaconda jpeg 9e h7f8727e_0
jsonschema 4.17.3 pypi_0 pypi kiwisolver 1.4.2 py310h295c915_0
krb5 1.19.2 hac12032_0
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libbrotlicommon 1.0.9 h5eee18b_7
libbrotlidec 1.0.9 h5eee18b_7
libbrotlienc 1.0.9 h5eee18b_7
libclang 10.0.1 default_hb85057a_2
libcublas 11.11.3.6 0 nvidia libcublas-dev 11.11.3.6 0 nvidia libcufft 10.9.0.58 0 nvidia libcufft-dev 10.9.0.58 0 nvidia libcufile 1.4.0.31 0 nvidia libcufile-dev 1.4.0.31 0 nvidia libcurand 10.3.0.86 0 nvidia libcurand-dev 10.3.0.86 0 nvidia libcurl 7.85.0 h91b91d3_0
libcusolver 11.4.1.48 0 nvidia libcusolver-dev 11.4.1.48 0 nvidia libcusparse 11.7.5.86 0 nvidia libcusparse-dev 11.7.5.86 0 nvidia libdeflate 1.8 h7f8727e_5
libedit 3.1.20210910 h7f8727e_0
libev 4.33 h7f8727e_1
libevent 2.1.12 h8f2d780_0
libffi 3.3 he6710b0_2
libgcc-ng 12.2.0 h65d4601_19 conda-forge libgfortran-ng 11.2.0 h00389a5_1
libgfortran5 11.2.0 h1234567_1
libiconv 1.16 h7f8727e_2
libidn2 2.3.2 h7f8727e_0
libllvm10 10.0.1 hbcb73fb_5
libnetcdf 4.8.1 mpi_mpich_h06c54e2_4 conda-forge libnghttp2 1.46.0 hce63b2e_0
libnpp 11.8.0.86 0 nvidia libnpp-dev 11.8.0.86 0 nvidia libnvjpeg 11.9.0.86 0 nvidia libnvjpeg-dev 11.9.0.86 0 nvidia libpng 1.6.37 hbc83047_0
libpq 12.9 h16c4e8d_3
libssh2 1.10.0 h8f2d780_0
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge libtasn1 4.16.0 h27cfd23_0
libtiff 4.4.0 hecacb30_0
libunistring 0.9.10 h27cfd23_0
libuuid 1.0.3 h7f8727e_2
libwebp 1.2.4 h11a3e52_0
libwebp-base 1.2.4 h5eee18b_0
libxcb 1.15 h7f8727e_0
libxkbcommon 1.0.1 hfa300c1_0
libxml2 2.9.14 h74e7548_0
libxslt 1.1.35 h4e12654_0
libzip 1.9.2 hc869a4a_1 conda-forge libzlib 1.2.13 h166bdaf_4 conda-forge llvm-openmp 14.0.6 h9e868ea_0
lz4-c 1.9.3 h295c915_1
matplotlib 3.5.2 py310h06a4308_0
matplotlib-base 3.5.2 py310hf590b9c_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py310h7f8727e_0
mkl_fft 1.3.1 py310hd6ae3a3_0
mkl_random 1.2.2 py310h00e6091_0
mpi 1.0 mpich conda-forge mpi4py 3.1.4 py310h37cc914_0 conda-forge mpich 4.0.3 h846660c_100 conda-forge msgpack 1.0.4 pypi_0 pypi munkres 1.1.4 py_0
ncurses 6.3 h5eee18b_3
netcdf-flattener 1.2.0 pyh9f0ad1d_0 conda-forge netcdf-fortran 4.6.0 mpi_mpich_hd09bd1e_1 conda-forge netcdf4 1.6.2 nompi_py310h55e1e36_100 conda-forge nettle 3.7.3 hbbd107a_1
nsight-compute 2022.3.0.22 0 nvidia nspr 4.33 h295c915_0
nss 3.74 h0370c37_0
numexpr 2.8.3 py310hcea2de6_0 anaconda numpy 1.23.3 py310hd5efca6_0
numpy-base 1.23.3 py310h8e6c178_0
opencv-python-headless 4.6.0.66 pypi_0 pypi openh264 2.1.1 h4ff587b_0
openssl 1.1.1s h0b41bf4_1 conda-forge packaging 21.3 pyhd3eb1b0_0
pandas 1.4.3 py310h6a678d5_0 anaconda parallelio 2.5.9 mpi_mpich_h50e6f33_101 conda-forge pcre 8.45 h295c915_0
pillow 9.2.0 py310hace64e9_1
pip 22.2.2 py310h06a4308_0
platformdirs 3.0.0 pypi_0 pypi ply 3.11 py310h06a4308_0
proj 7.2.0 h277dcde_2 conda-forge protobuf 3.20.1 pypi_0 pypi psutil 5.9.4 py310h5764c6d_0 conda-forge pycparser 2.21 pyhd3eb1b0_0
pyopenssl 22.0.0 pyhd3eb1b0_0
pyparsing 3.0.9 py310h06a4308_0
pyqt 5.15.7 py310h6a678d5_1
pyqt5-sip 12.11.0 pypi_0 pypi pyrsistent 0.19.3 pypi_0 pypi pyshp 2.3.1 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 py310h06a4308_0
python 3.10.0 h12debd9_5
python-dateutil 2.8.2 pyhd3eb1b0_0
python_abi 3.10 2_cp310 conda-forge pytorch 1.13.0 py3.10_cuda11.7_cudnn8.5.0_0 pytorch pytorch-cuda 11.7 h67b0de4_0 pytorch pytorch-model-summary 0.1.1 py_0 conda-forge pytorch-mutex 1.0 cuda pytorch pytz 2022.1 py310h06a4308_0 anaconda pyyaml 6.0 pypi_0 pypi qt-main 5.15.2 h327a75a_7
qt-webengine 5.15.9 hd2b0992_4
qtwebkit 5.212 h4eab89a_4
ray 2.3.0 pypi_0 pypi readline 8.2 h5eee18b_0
requests 2.28.1 py310h06a4308_0
scikit-learn 1.1.1 py310h6a678d5_0 anaconda scipy 1.9.1 py310hd5efca6_0
setuptools 65.5.0 py310h06a4308_0
shapely 1.8.4 py310h81ba7c5_0
sip 6.6.2 py310h6a678d5_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.39.3 h5082296_0
tabulate 0.9.0 pypi_0 pypi tempest-extremes 2.2.1 mpi_mpich_h9b66f1e_0 conda-forge tensorboardx 2.5.1 pypi_0 pypi threadpoolctl 2.2.0 pyh0d69192_0 anaconda tk 8.6.12 h1ccaba5_0
toml 0.10.2 pyhd3eb1b0_0
torch-metrics 1.1.7 pypi_0 pypi torch-summary 1.4.5 pypi_0 pypi torchaudio 0.13.0 py310_cu117 pytorch torchmetrics 0.11.0 pypi_0 pypi torchvision 0.14.0 py310_cu117 pytorch tornado 6.2 py310h5eee18b_0
tqdm 4.64.1 py310h06a4308_0
typing_extensions 4.3.0 py310h06a4308_0
tzdata 2022e h04d1e81_0
udunits2 2.2.28 hc3e0081_0 conda-forge urllib3 1.26.12 py310h06a4308_0
virtualenv 20.19.0 pypi_0 pypi wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.6 h5eee18b_0
yacs 0.1.8 pypi_0 pypi yaml 0.2.5 h7b6447c_0 anaconda zlib 1.2.13 h166bdaf_4 conda-forge zstd 1.5.2 ha4553b6_0

Reproduction script

My python script is as follows:

import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from filelock import FileLock
from torch.utils.data import random_split
from torchvision import datasets
import torchvision.transforms as transforms
import ray
from ray import tune
from ray.air import session
from ray.air.checkpoint import Checkpoint
from ray.tune.schedulers import ASHAScheduler, HyperBandScheduler

def get_data_loaders():
    mnist_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307, ), (0.3081, ))])

    # We add FileLock here because multiple workers will want to
    # download data, and this may cause overwrites since
    # DataLoader is not threadsafe.
    with FileLock(os.path.expanduser("~/data.lock")):
        train_loader = torch.utils.data.DataLoader(
            datasets.MNIST("~/data", train=True, download=True, transform=mnist_transforms), batch_size=64,  shuffle=True)
        test_loader = torch.utils.data.DataLoader(
            datasets.MNIST("~/data", train=False, download=True, transform=mnist_transforms), batch_size=64, shuffle=True)
    return train_loader, test_loader

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
        self.fc = nn.Linear(192, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 3))
        x = x.view(-1, 192)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)

def train_cifar(config):
    net = Net()

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=config["momentum"])

    # To restore a checkpoint, use `session.get_checkpoint()`.
    loaded_checkpoint = session.get_checkpoint()
    if loaded_checkpoint:
        with loaded_checkpoint.as_directory() as loaded_checkpoint_dir:
           model_state, optimizer_state = torch.load(os.path.join(loaded_checkpoint_dir, "checkpoint.pt"))
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    data_dir = os.path.abspath("./data")

    trainloader, valloader = get_data_loaders()

    for epoch in range(10):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1, running_loss / epoch_steps))
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        # Here we save a checkpoint. It is automatically registered with
        # Ray Tune and can be accessed through `session.get_checkpoint()`
        # API in future iterations.
        os.makedirs("my_model", exist_ok=True)
        torch.save((net.state_dict(), optimizer.state_dict()), "my_model/checkpoint.pt")
        checkpoint = Checkpoint.from_directory("my_model")
        session.report({"loss": (val_loss / val_steps), "accuracy": correct / total})#, checkpoint=checkpoint)
    print("Finished Training")

def test_best_model(best_result):
    best_trained_model = Net()
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    best_trained_model.to(device)

    checkpoint_path = os.path.join(best_result.checkpoint.to_directory(), "checkpoint.pt")

    model_state, optimizer_state = torch.load(checkpoint_path)
    best_trained_model.load_state_dict(model_state)

    trainloader, testloader = get_data_loaders()

    correct = 0
    total = 0
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = best_trained_model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print("Best trial test set accuracy: {}".format(correct / total))

def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
    config = {"lr": tune.loguniform(1e-4, 1e-1),  "momentum": tune.uniform(0.1, 0.9)}
    scheduler = HyperBandScheduler()

    tuner = tune.Tuner(tune.with_resources(tune.with_parameters(train_cifar), resources={"cpu": 2, "gpu": gpus_per_trial}),
        tune_config=tune.TuneConfig(metric="loss", mode="min", scheduler=scheduler, num_samples=num_samples),
        param_space=config,)
    results = tuner.fit()

    best_result = results.get_best_result("loss", "min")

    print("Best trial config: {}".format(best_result.config))
    print("Best trial final validation loss: {}".format(
        best_result.metrics["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_result.metrics["accuracy"]))

    test_best_model(best_result)

if __name__ == "__main__":
    main(num_samples=2, max_num_epochs=2, gpus_per_trial=0)

I am running this python script using the following SLURM script:

#!/bin/bash

#SBATCH --job-name=ray
#SBATCH --output=ray.out
#SBATCH --error=ray.err
#SBATCH --time=24:00:00
#SBATCH --partition=pbatch
#SBATCH -A cbronze

### This script works for any number of nodes, Ray will find and manage all resources
#SBATCH --ntasks=4

### Give all resources to a single Ray task, ray can manage the resources internally
#SBATCH --ntasks-per-node=2
###SBATCH --gpus-per-task=1
###SBATCH --cpus-per-task=36

. /usr/workspace/galea1/anaconda3/etc/profile.d/conda.sh
conda activate tracking

redis_password=$(uuidgen)
export redis_password

nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )

node_1=${nodes_array[0]} 
ip=$node_1
port=6379
ip_head=$ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "STARTING HEAD at $node_1"
srun --nodes=1 --ntasks=1 -w $node_1 start-head.sh $ip $redis_password &
sleep 30

worker_num=$(($SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node
for ((  i=1; i<=$worker_num; i++ ))
do
  node_i=${nodes_array[$i]}
  echo "STARTING WORKER $i at $node_i"
  srun --nodes=1 --ntasks=1 -w $node_i start-worker.sh $ip_head $redis_password &
  sleep 5
done
##############################################################################################

#### call your code below
python mnist.py --cuda
exit

I am starting my head node using:

#!/bin/bash

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

echo "starting ray head node"
# Launch the head node
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2
sleep infinity

and my worker nodes using:

#!/bin/bash

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

echo "starting ray worker node"
ray start --address $1 --redis-password=$2
sleep infinity

Issue Severity

High: It blocks me from completing my task.

justinvyu commented 1 year ago

Hi @dangalea, is your ray version the same (2.3.0) on all nodes?

dangalea commented 1 year ago

Hi @justinvyu, yes they are all using the same environment.

justinvyu commented 1 year ago

Could you try printing out the list of nodes in the cluster, on each node?

Something like:

# Run on your head node
import ray
from ray.air.util.node import _force_on_node

ray.init()

@ray.remote
def log():
    print("Me:", ray.get_runtime_context().get_node_id())
    print("Me + everyone else:", [node["NodeID"] for node in ray.nodes()])

# Does your head node see everyone?
assert len(ray.nodes()) == 1  # insert your expected value

for node in ray.nodes():
    # Do your worker nodes see everyone?
    ray.get(log.options(**_force_on_node(node["NodeID"])).remote())

Also, could you try this on your head node?

import ray
from ray.air.util.node import _get_node_id_from_node_ip

ray.init()

print(ray.get_runtime_context().get_node_id())
print(_get_node_id_from_node_ip(ray.util.get_node_ip_address()))
dangalea commented 1 year ago

Hi @justinvyu,

I get the following when executing the first snippet on my nodes:

Me: 361e04dc0722ba357d9fa31db01682f56cb1e81e0cf19955dae0e94c
Me + everyone else: ['361e04dc0722ba357d9fa31db01682f56cb1e81e0cf19955dae0e94c', '8ffeb23df2477e6c5ca3c2da37c02463257f5b3d2e59bb8d1fd79d1d']
Me: 8ffeb23df2477e6c5ca3c2da37c02463257f5b3d2e59bb8d1fd79d1d
Me + everyone else: ['361e04dc0722ba357d9fa31db01682f56cb1e81e0cf19955dae0e94c', '8ffeb23df2477e6c5ca3c2da37c02463257f5b3d2e59bb8d1fd79d1d']

I think this shows that all nodes (2 in my case) can see each other. However, I should have 4 GPUs listed (2 nodes of 2 GPUs each). Does this affect things?

Also, when I run your second snippet, I get:

361e04dc0722ba357d9fa31db01682f56cb1e81e0cf19955dae0e94c
None

I also noticed that I have this in my error output, which may be relevant:

[2023-03-01 09:38:30,168 I 3213126 3213126] global_state_accessor.cc:356: This node has an IP address of 192.168.128.34, while we can not find the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
dangalea commented 1 year ago

I've also have tried running this on one node, i.e. the head and worker node are the same, but the error still persists.

matthewdeng commented 1 year ago

Hey @dangalea , I think the error output you shared may be relevant. Can you try running the following on your head node?

import ray
ray.init()
print(f"Current IP: {ray.util.get_node_ip_address()}")
print(f"Current Node ID: {ray.get_runtime_context().get_node_id()}")
print(f"Nodes: {ray.nodes()}")
justinvyu commented 1 year ago

cc @jjyao

dangalea commented 1 year ago

Hey @matthewdeng, this is what I get:

Current IP: 192.168.128.34
Current Node ID: 420ee8af5b8c66e85474afbdbc13c4bb9eb1bc06138f737d30871bcc
Nodes: [{'NodeID': '420ee8af5b8c66e85474afbdbc13c4bb9eb1bc06138f737d30871bcc', 'Alive': True, 'NodeManagerAddress': 'pascal35', 'NodeManagerHostname': 'pascal35', 'NodeManagerPort': 44637, 'ObjectManagerPort': 46101, 'ObjectStoreSocketName': '/var/tmp/galea1/ray/session_2023-03-01_10-46-31_212388_3224263/sockets/plasma_store', 'RayletSocketName': '/var/tmp/galea1/ray/session_2023-03-01_10-46-31_212388_3224263/sockets/raylet', 'MetricsExportPort': 47246, 'NodeName': 'pascal35', 'alive': True, 'Resources': {'object_store_memory': 80584793702.0, 'GPU': 2.0, 'CPU': 72.0, 'node:pascal35': 1.0, 'memory': 178031185306.0, 'accelerator_type:P100': 1.0}}, {'NodeID': 'b74500a5af9952196fc6a294cfa984e6212a4ed51d469e287a7a7dfd', 'Alive': True, 'NodeManagerAddress': '192.168.128.35', 'NodeManagerHostname': 'pascal36', 'NodeManagerPort': 41949, 'ObjectManagerPort': 43949, 'ObjectStoreSocketName': '/var/tmp/galea1/ray/session_2023-03-01_10-46-31_212388_3224263/sockets/plasma_store', 'RayletSocketName': '/var/tmp/galea1/ray/session_2023-03-01_10-46-31_212388_3224263/sockets/raylet', 'MetricsExportPort': 46173, 'NodeName': '192.168.128.35', 'alive': True, 'Resources': {'accelerator_type:P100': 1.0, 'memory': 188305381376.0, 'object_store_memory': 80702306304.0, 'GPU': 2.0, 'CPU': 72.0, 'node:192.168.128.35': 1.0}}]
matthewdeng commented 1 year ago

Hmm yeah seems like it's because the NodeManagerAddress is pascal35 (which seems to be the host name?) here rather than the IP address.

Head Node: 'NodeManagerAddress': 'pascal35', 'NodeManagerHostname': 'pascal35' Worker Node: 'NodeManagerAddress': '192.168.128.35', 'NodeManagerHostname': 'pascal36'

@jjyao can you take a look at this and see if NodeManagerAddress should be the IP address instead? Or if the current output is expected, should the logic to map IP to nodeId be changed? https://github.com/ray-project/ray/blob/a892241ca7574af47f278a667e6493a4b03686d7/python/ray/air/util/node.py#L5-L11

jjyao commented 1 year ago

NodeManagerAddress should be IP. On the head node, @dangalea could you search for

RAY_LOG(INFO) << "Raylet of id, " << self_node_id_
                  << " started. Raylet consists of node_manager and object_manager."
                  << " node_manager address: " << self_node_info_.node_manager_address()
                  << ":" << self_node_info_.node_manager_port()
                  << " object_manager address: " << self_node_info_.node_manager_address()
                  << ":" << self_node_info_.object_manager_port()
                  << " hostname: " << self_node_info_.node_manager_hostname();

in /tmp/ray/session_latest/logs/raylet.out

jjyao commented 1 year ago

Also could you show the full command of raylet process on the head node via ps aux | grep raylet

dangalea commented 1 year ago

@jjyao, I could not find the file at /tmp/ray/session_latest/logs/raylet.out, however this is what I get for ps aux | grep raylet:

galea1   3228257  1.5  0.0 82858816 29048 ?      Sl   11:33   0:00 /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet --store_socket_name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=pascal35 --maximum_startup_concurrency=72 --static_resource_list=node:pascal35,1.0,accelerator_type:P100,1,CPU,72,GPU,2,memory,178010839655,object_store_memory,80576074137 --python_worker_command=/usr/workspace/galea1/conda_envs/envs/tracking/bin/python /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/_private/workers/setup_worker.py /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/_private/workers/default_worker.py --node-ip-address=pascal35 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store --raylet-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet --redis-address=None --temp-dir=/var/tmp/galea1/ray --metrics-agent-port=61074 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=pascal35:6379 --session-name=session_2023-03-01_11-32-59_368536_3227792 --temp-dir=/var/tmp/galea1/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=cf14098f-f540-43fd-8436-2333f52d04a8 --java_worker_command=/usr/workspace/galea1/conda_envs/envs/tracking/bin/python /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/_private/workers/setup_worker.py -Dray.address=pascal35:6379 -Dray.raylet.node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER -Dray.object-store.socket-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store -Dray.raylet.socket-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet -Dray.redis.password=cf14098f-f540-43fd-8436-2333f52d04a8 -Dray.node-ip=pascal35 -Dray.home=/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/../.. -Dray.logging.dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/logs -Dray.session-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER io.ray.runtime.runner.worker.DefaultWorker --cpp_worker_command= --native_library_path=/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/cpp/lib --temp_dir=/var/tmp/galea1/ray --session_dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792 --log_dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/logs --resource_dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/runtime_resources --metrics-agent-port=61074 --metrics_export_port=62332 --object_store_memory=80576074137 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=pascal35:6379 --session-name=session_2023-03-01_11-32-59_368536_3227792 --agent_command=/usr/workspace/galea1/conda_envs/envs/tracking/bin/python -u /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/dashboard/agent.py --node-ip-address=pascal35 --metrics-export-port=62332 --dashboard-agent-port=61074 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store --raylet-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet --temp-dir=/var/tmp/galea1/ray --session-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792 --runtime-env-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/runtime_resources --log-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-03-01_11-32-59_368536_3227792 --gcs-address=pascal35:6379 --minimal --node-name=pascal35
galea1   3228490  4.1  0.0 3110140 96480 ?       Sl   11:33   0:00 /usr/workspace/galea1/conda_envs/envs/tracking/bin/python -u /usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/dashboard/agent.py --node-ip-address=pascal35 --metrics-export-port=62332 --dashboard-agent-port=61074 --listen-port=52365 --node-manager-port=36579 --object-store-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/plasma_store --raylet-name=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/sockets/raylet --temp-dir=/var/tmp/galea1/ray --session-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792 --runtime-env-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/runtime_resources --log-dir=/var/tmp/galea1/ray/session_2023-03-01_11-32-59_368536_3227792/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-03-01_11-32-59_368536_3227792 --gcs-address=pascal35:6379 --minimal --agent-id 1059961393
matthewdeng commented 1 year ago

Ah, could you set the head_node_ip as described in the SLURM documentation?

jjyao commented 1 year ago

Thanks @dangalea !

--node_ip_address=pascal35 this is already wrong.

You mentioned that you used the following command to start the head node:

# Launch the head node
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2

What's the value of $1? Is it pascal35 or an ip?

dangalea commented 1 year ago

@jjyao, $1 is pascal35.

I have taken @matthewdeng's advice and reformulated my submission script. This is now:

#!/bin/bash

#SBATCH --job-name=ray
#SBATCH --output=ray.out
#SBATCH --error=ray.err
#SBATCH --time=24:00:00
#SBATCH --partition=pbatch
#SBATCH -A cbronze

### This script works for any number of nodes, Ray will find and manage all resources
#SBATCH --ntasks=4

### Give all resources to a single Ray task, ray can manage the resources internally
#SBATCH --ntasks-per-node=2
##SBATCH --gpus-per-task=2
###SBATCH --cpus-per-task=36

. /usr/workspace/galea1/anaconda3/etc/profile.d/conda.sh
conda activate tracking

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
  head_node_ip=${ADDR[1]}
else
  head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi

port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    ray start --head --node-ip-address="$head_node_ip" --port=$port \
    --block &

worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        ray start --address "$ip_head" \
        --block &
    sleep 5
done

python mnist.py --cuda

This solves my initial problem, but now any node which is not the head node is not being used by ray. Would you know what might be the problem?

jjyao commented 1 year ago

but now any node which is not the head node is not being used by ray.

You mean the ray cluster only contains the head node but no worker nodes? How did you realize that? What's the output of ray status?

dangalea commented 1 year ago

Not quite. Ray is available in both the head node and worker node. ray status on the head node returns this error:

(base) [galea1@pascal35:bin]$ ./ray status
Traceback (most recent call last):
  File "/usr/WS2/galea1/conda_envs/envs/tracking/bin/./ray", line 8, in <module>
    sys.exit(main())
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2422, in main
    return cli()
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1907, in status
    address = services.canonicalize_bootstrap_address_or_die(address)
  File "/usr/workspace/galea1/conda_envs/envs/tracking/lib/python3.10/site-packages/ray/_private/services.py", line 541, in canonicalize_bootstrap_address_or_die
    raise ConnectionError(
ConnectionError: Found multiple active Ray instances: {'192.168.128.34:6379', '192.168.128.34:62451'}. Please specify the one to connect to by setting the `--address` flag or `RAY_ADDRESS` environment variable.

ray status on the worker node returns:

(base) [galea1@pascal36:bin]$ ./ray status
======== Autoscaler status: 2023-03-01 13:42:49.214032 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_e3379008830a99ac39b8c8efe715e72ae8e1a21231b0fa969aac275e
 1 node_3656fe502fa56d0f5b12988ee02debc2bb3baf6834f154bcc08e08fe
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/144.0 CPU
 0.0/4.0 GPU
 0.0/2.0 accelerator_type:P100
 0.00/341.077 GiB memory
 0.00/150.167 GiB object_store_memory

Demands:
 (no resource demands)

but the worker node is not being used. Could this be a mismatch in ports?

jjyao commented 1 year ago

ConnectionError: Found multiple active Ray instances: {'192.168.128.34:6379', '192.168.128.34:62451'}. Please specify the one to connect to by setting the--addressflag orRAY_ADDRESSenvironment variable.

You started multiple ray instances on the same head machine? Is it because you didn't clean up the old ones? Could you stop everything and restart the ray cluster?

dangalea commented 1 year ago

I have checked that I have not had any stale instances. Could my script be starting two instances at the same time?

jjyao commented 1 year ago

I think I might know what the problem is:

In your Ray application, could you change ray.init() to ray.init(address="auto"). Currently there is a bug that if you call ray.init() it will create a new single node cluster instead of connecting to an existing cluster.

dangalea commented 1 year ago

That seems to work now. I did not have ray.init() before. When I add it and I run ray status on either the head node or the worker node, I get the following:

(base) [galea1@pascal35:bin]$ ./ray status
======== Autoscaler status: 2023-03-01 16:47:37.843071 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_0bf2868539ecac836c61a352462452b953710b904e9f6b0b9d4b25c1
 1 node_9141273b0ca7727b27c33f7d427040275356cfc7a2ea43685c400099
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 8.0/144.0 CPU (8.0 used of 8.0 reserved in placement groups)
 4.0/4.0 GPU (4.0 used of 4.0 reserved in placement groups)
 0.0/2.0 accelerator_type:P100
 0.00/341.044 GiB memory
 0.00/150.153 GiB object_store_memory

Demands:
 (no resource demands)

However, I am still concerned that the accelerator is not being used. The GPU utilisation rate is at 2% across all 4 GPUs.

jjyao commented 1 year ago

Glad to hear that it's working.

@matthewdeng @justinvyu could you take over from here for the GPU utilization issue?

matthewdeng commented 1 year ago

Awesome!

For the GPU issue, can you confirm what you're running?

From the original script, it looks like you are setting gpus_per_trial=0, but from the ray status output it looks like GPUs are reserved?

dangalea commented 1 year ago

Yes I'd changed that to gpus_per_trail=1. From nvidia-smi, the GPUs' memory is being used but the usage percentage seems too low:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   32C    P0    31W / 250W |    880MiB / 16384MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:07:00.0 Off |                    0 |
| N/A   29C    P0    32W / 250W |    880MiB / 16384MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    331262      C   ray::ImplicitFunc.train           878MiB |
|    1   N/A  N/A    331448      C   ray::ImplicitFunc.train           878MiB |
+-----------------------------------------------------------------------------+
matthewdeng commented 1 year ago

Okay, that's good to know. In that case I think the most likely reason for this is that the script is a bit of a "toy example" (if you ran the same PyTorch training code without Ray you would see similar GPU utilization).

Some potential ways to see higher GPU utilization are to increase the complexity of the model, or to increase batch size.

dangalea commented 1 year ago

Ok thanks for that, I'll try changing some parameters around. Thanks for all your guys' help.

mkvakic-srce commented 1 year ago

Hello everyone,

Sorry to be reopening an issue, but I'm running in to the same error regardless of implementing the above mentioned solution...

When running an example very similar to the distributed ResNet50 PyTorch on an HPC cluster with Slingshot interconnect and the rayproject/ray-ml:2.3.1-py39-cu116 container, the ray cluster exits with an 'NoneType' object has no attribute 'hex' error:

Failure # 1 (occurred at 2023-04-27_06-22-29)
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1544, in stop_trial
    self._callbacks.on_trial_complete(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/callback.py", line 360, in on_trial_complete
    callback.on_trial_complete(**info)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/syncer.py", line 731, in on_trial_complete
    self._sync_trial_dir(trial, force=True, wait=True)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/syncer.py", line 685, in _sync_trial_dir
    sync_process.wait()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/syncer.py", line 237, in wait
    raise exception
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/syncer.py", line 200, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 175, in _sync_dir_between_different_nodes
    num_cpus=0, **_force_on_node(target_node_id)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/util/node.py", line 35, in _force_on_node
    scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/scheduling_strategies.py", line 61, in __init__
    node_id = node_id.hex()
AttributeError: 'NoneType' object has no attribute 'hex'

This happens regardless of the fact that all ray processes are on the same node or not (at least the worker ones)

In my case, I set the ip address to the Slingshot IP on the head node with the following line:

head_ip_address=$( ip -f inet addr show hsn0 | egrep -m 1 -o 'inet [0-9.]{1,}' | sed 's/inet //' )
ray start --head --node-ip-address=$head_ip_addres ...

The cluster is verified by ssh'ing to the assigned nodes and running ray status which returns on all nodes (for example):

======== Autoscaler status: 2023-04-27 05:08:04.778682 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_446af567b5feda490ed5d1a2a44ae59ac51308e9620c1f7283314593
 1 node_88ecddf9c244790b85d086375bad4e8871549d0fdbe65e7ad4343233
 1 node_ca33a52732fc431c596842d423f90357814a0f07186ecde9a556af9f
 1 node_94c571d20ed87dbd7e64191d2846be42368225da0a1ccf296bd768bd
 1 node_bbd25329f4a4023e37ec20e15d8a3c7dde89421c4bd55f6180eee808
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/35.0 CPU
 0.0/4.0 GPU
 0.0/5.0 accelerator_type:A100
 0.00/1729.404 GiB memory
 0.00/745.165 GiB object_store_memory

Demands:
 (no resource demands)

Running the example via python3 my_script.py the script seems to training on multiple GPUs when issuing nvidia-smi:

Thu Apr 27 15:22:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:03:00.0 Off |                    0 |
| N/A   45C    P0   260W / 400W |   5409MiB / 40960MiB |     46%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   43C    P0    96W / 400W |   5385MiB / 40960MiB |     38%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   46C    P0    93W / 400W |   5409MiB / 40960MiB |     54%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   47C    P0   245W / 400W |   5385MiB / 40960MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3756461      C   ...._RayTrainWorker__execute     5406MiB |
|    1   N/A  N/A   3756462      C   ...._RayTrainWorker__execute     5382MiB |
|    2   N/A  N/A   3756460      C   ...._RayTrainWorker__execute     5406MiB |
|    3   N/A  N/A   3756459      C   ...._RayTrainWorker__execute     5382MiB |
+-----------------------------------------------------------------------------+

But only until the first epoch is finished, when it exits with:

...
== Status ==
Current time: 2023-04-27 06:22:21 (running for 00:01:12.14)
Memory usage on this node: 28.0/502.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 33.0/35 CPUs, 4.0/4 GPUs, 0.0/1715.33 GiB heap, 0.0/739.13 GiB objects (0.0/5.0 accelerator_type:A100)
Result logdir: ${RAY_TMPDIR}/ray_results/TorchTrainer_2023-04-27_06-21-09
Number of trials: 1/1 (1 RUNNING)
+--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------+
| Trial name               | status   | loc                 |   iter |   total time (s) |    loss |   _timestamp |   _time_this_iter_s |
|--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------|
| TorchTrainer_5f9b6_00000 | RUNNING  | 10.150.0.31:3756296 |      9 |          68.0712 | 2.33203 |   1682601741 |              4.8599 |
+--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------+
...
== Status ==
Current time: 2023-04-27 06:22:29 (running for 00:01:19.93)
Memory usage on this node: 27.1/502.7 GiB 
Using FIFO scheduling algorithm.
Resources requested: 0/35 CPUs, 0/4 GPUs, 0.0/1715.33 GiB heap, 0.0/739.13 GiB objects (0.0/5.0 accelerator_type:A100)
Result logdir: ${RAY_TMPDIR}/ray_results/TorchTrainer_2023-04-27_06-21-09
Number of trials: 1/1 (1 ERROR)
+--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------+
| Trial name               | status   | loc                 |   iter |   total time (s) |    loss |   _timestamp |   _time_this_iter_s |
|--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------|
| TorchTrainer_5f9b6_00000 | ERROR    | 10.150.0.31:3756296 |     10 |          73.0525 | 2.33749 |   1682601746 |             4.91388 |
+--------------------------+----------+---------------------+--------+------------------+---------+--------------+---------------------+
Number of errored trials: 1
+--------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------+
| Trial name               |   # failures | error file                                                                                                                 |
|--------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------|
| TorchTrainer_5f9b6_00000 |            1 | ${RAY_TMPDIR}/ray_results/TorchTrainer_2023-04-27_06-21-09/TorchTrainer_5f9b6_00000_0_2023-04-27_06-21-10/error.txt |
+--------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------+

Ray cluster is brought up using PBS Pro mpiexec command which (depending on the MPI rank) starts off head, worker and submit jobs, where the process tree looks like:

    PID TTY      STAT   TIME COMMAND
3757327 ?        S      0:00 sshd: my_username@notty
3757328 ?        Rs     0:00  \_ ps f -u my_username
3752957 ?        Ss     0:00 -bash
3753025 ?        S      0:00  \_ /bin/bash /var/spool/pbs/mom_priv/jobs/7963.login-node.SC
3753050 ?        S      0:00      \_ /bin/bash ./ray-launcher.sh pytorch.py -n 4 --use-gpu True
3753066 ?        Sl     0:00          \_ mpiexec -np 6 ./ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753069 ?        Ss     0:00 /usr/sbin/palsd
3753072 ?        S      0:00  \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753077 ?        S      0:00  |   \_ /bin/bash ./ray-head.sh
3753108 ?        Sl     0:00  |       \_ Apptainer runtime parent
3753120 ?        Sl     0:00  |           \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --head --block --include-dashboard False --port=34438 --node-ip-address=10.150.0.31 --node-manager-port=43925 --object-manager-port=33464 --ray-client-server-port=45596 --redis-shard-ports= --min-worker-port=50014 --max-worker-port=50114 --log-style=record --num-gpus=0 --num-cpus 3
3753144 ?        Sl     0:01  |           |   \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --config_list=eyJvYmplY3Rfc3BpbGxpbmdfY29uZmlnIjogIntcInR5cGVcIjogXCJmaWxlc3lzdGVtXCIsIFwicGFyYW1zXCI6IHtcImRpcmVjdG9yeV9wYXRoXCI6IFwiL2x1c3RyZS9ob21lL21rdmFraWMvcmF5L3Nlc3Npb25fMjAyMy0wNC0yN18wNi0yMC0zN184MjMzMzlfMzc1MzEyMFwifX0iLCAiaXNfZXh0ZXJuYWxfc3RvcmFnZV90eXBlX2ZzIjogdHJ1ZX0= --gcs_server_port=34438 --metrics-agent-port=46221 --node-ip-address=10.150.0.31 --session-name=session_2023-04-27_06-20-37_823339_3753120
3753292 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --monitor-ip=10.150.0.31
3753365 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/bin/python -m ray.util.client.server --address=10.150.0.31:34438 --host=0.0.0.0 --port=45596 --mode=proxy --metrics-agent-port=46221
3753438 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/dashboard.py --host=localhost --port=8265 --port-retries=0 --temp-dir=${RAY_TMPDIR}/ray --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --modules-to-load=UsageStatsHead --disable-frontend
3753656 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --object_manager_port=33464 --min_worker_port=50014 --max_worker_port=50114 --node_manager_port=43925 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=3 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,3,memory,360757881447,object_store_memory,158896234905 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=46221 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=46221 --metrics_export_port=39641 --object_store_memory=158896234905 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=39641 --dashboard-agent-port=46221 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --node-name=10.150.0.31
3753908 ?        Sl     0:01  |           |   |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=39641 --dashboard-agent-port=46221 --listen-port=52365 --node-manager-port=43925 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756296 ?        SNl    0:02  |           |   |   \_ ray::_Inner.train
3753729 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3753131 ?        Sl     0:02  |           \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753073 ?        S      0:00  \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753078 ?        S      0:00  |   \_ /bin/bash ./ray-submit.sh pytorch.py -n 4 --use-gpu True
3756060 ?        Sl     0:00  |       \_ Apptainer runtime parent
3756075 ?        Sl     0:02  |           \_ /home/ray/anaconda3/bin/python3 pytorch.py -n 4 --use-gpu True
3756086 ?        Sl     0:01  |           \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753074 ?        S      0:00  \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753080 ?        S      0:00  |   \_ /bin/bash ./ray-worker.sh
3754116 ?        Sl     0:00  |       \_ Apptainer runtime parent
3754130 ?        Sl     0:00  |           \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --block --num-cpus 8 --log-style=record --address 10.150.0.31:34438
3754214 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=8 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,8,GPU,1,memory,370563281716,object_store_memory,158812835020 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=64793 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=64793 --metrics_export_port=61063 --object_store_memory=158812835020 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=61063 --dashboard-agent-port=64793 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438
3754427 ?        Sl     0:02  |           |   |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=61063 --dashboard-agent-port=64793 --listen-port=52365 --node-manager-port=43831 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.1 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.1 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756462 ?        SNl    0:34  |           |   |   \_ ray::RayTrainWorker._RayTrainWorker__execute
3754287 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3754141 ?        Sl     0:08  |           \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753075 ?        S      0:00  \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753081 ?        S      0:00  |   \_ /bin/bash ./ray-worker.sh
3754578 ?        Sl     0:00  |       \_ Apptainer runtime parent
3754593 ?        Sl     0:00  |           \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --block --num-cpus 8 --log-style=record --address 10.150.0.31:34438
3754659 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=8 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,8,GPU,1,memory,370365912269,object_store_memory,158728248115 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=59363 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=59363 --metrics_export_port=48326 --object_store_memory=158728248115 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=48326 --dashboard-agent-port=59363 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438
3754872 ?        Sl     0:02  |           |   |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=48326 --dashboard-agent-port=59363 --listen-port=52365 --node-manager-port=36567 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.2 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.2 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756459 ?        SNl    0:34  |           |   |   \_ ray::RayTrainWorker._RayTrainWorker__execute
3754732 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3754604 ?        Sl     0:08  |           \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753076 ?        S      0:00  \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753087 ?        S      0:00  |   \_ /bin/bash ./ray-worker.sh
3755024 ?        Sl     0:00  |       \_ Apptainer runtime parent
3755039 ?        Sl     0:00  |           \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --block --num-cpus 8 --log-style=record --address 10.150.0.31:34438
3755105 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=8 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,8,GPU,1,memory,370169099060,object_store_memory,158643899596 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=59054 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=59054 --metrics_export_port=62448 --object_store_memory=158643899596 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=62448 --dashboard-agent-port=59054 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438
3755318 ?        Sl     0:02  |           |   |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=62448 --dashboard-agent-port=59054 --listen-port=52365 --node-manager-port=43811 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.3 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.3 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756461 ?        SNl    0:34  |           |   |   \_ ray::RayTrainWorker._RayTrainWorker__execute
3755178 ?        Sl     0:00  |           |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3755050 ?        Sl     0:08  |           \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3753079 ?        S      0:00  \_ /bin/bash /var/run/palsd/36ee5525-48fe-4c7e-b97c-042aabc4f466/files/ray-cluster.sh pytorch.py -n 4 --use-gpu True
3753088 ?        S      0:00      \_ /bin/bash ./ray-worker.sh
3755483 ?        Sl     0:00          \_ Apptainer runtime parent
3755498 ?        Sl     0:00              \_ /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --block --num-cpus 8 --log-style=record --address 10.150.0.31:34438
3755564 ?        Sl     0:00              |   \_ /home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.150.0.31 --maximum_startup_concurrency=8 --static_resource_list=node:10.150.0.31,1.0,accelerator_type:A100,1,CPU,8,GPU,1,memory,369969616487,object_store_memory,158558407065 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.150.0.31 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --redis-address=None --temp-dir=${RAY_TMPDIR}/ray --metrics-agent-port=64035 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --temp-dir=${RAY_TMPDIR}/ray --webui= --storage=None RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --ray_raylet_socket_name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.150.0.31:34438 --ray_redis_password= --ray_session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --ray_logs_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --ray_node_ip_address=10.150.0.31 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --temp_dir=${RAY_TMPDIR}/ray --session_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --log_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --resource_dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --metrics-agent-port=64035 --metrics_export_port=41579 --object_store_memory=158558407065 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.150.0.31:34438 --session-name=session_2023-04-27_06-20-37_823339_3753120 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=41579 --dashboard-agent-port=64035 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438
3755906 ?        Sl     0:02              |   |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.150.0.31 --metrics-export-port=41579 --dashboard-agent-port=64035 --listen-port=52365 --node-manager-port=33413 --object-store-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/plasma_store.4 --raylet-name=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/sockets/raylet.4 --temp-dir=${RAY_TMPDIR}/ray --session-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120 --runtime-env-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/runtime_resources --log-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2023-04-27_06-20-37_823339_3753120 --gcs-address=10.150.0.31:34438 --agent-id 424238335
3756201 ?        SNl    0:00              |   |   \_ ray::IDLE
3756460 ?        SNl    0:34              |   |   \_ ray::RayTrainWorker._RayTrainWorker__execute
3755637 ?        Sl     0:00              |   \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/log_monitor.py --logs-dir=${RAY_TMPDIR}/ray/session_2023-04-27_06-20-37_823339_3753120/logs --gcs-address=10.150.0.31:34438 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5
3755509 ?        Sl     0:08              \_ /usr/libexec/apptainer/bin/squashfuse_ll -f -o uid=2010,gid=2001,offset=45056 /proc/self/fd/3 /var/apptainer/mnt/session/rootfs
3755735 ?        Ss     0:00 /usr/lib/systemd/systemd --user
3755737 ?        S      0:00  \_ (sd-pam)

UPDATE - The problems seems not to be there if the first network interface is used. But this is not an option, as it is used for system services and has far lower bandwidth and latency

UPDATE^2 - For anyone who stumbles upon the same issue: The problem was fixed by initializing ray with:

...
ray.init(address='auto',  _node_ip_address=os.environ['NODE_IP_ADDRESS'])
...

Where NODE_IP_ADDRESS corresponds to the ip adress on the network interface used (in my case hsn0)