minyang-chen / multi-nodes-slurm-cluster-docker

fully dockerized distributed multi-nodes slurm cluster - ubuntu 20.04
Apache License 2.0
1 stars 0 forks source link

unable to resolve host slurmmaster: Name or service not known #1

Open Gaopeng-Bai opened 1 month ago

Gaopeng-Bai commented 1 month ago

I am conducting tests on WSL, modifying the slurm.conf and gres.conf configuration files, and using only one node with a GPU. On the WSL system, I modified the /etc/hosts file with the format from the host file in the repository. Then I ran steps 1 to 4. Finally, when running ./register_cluster.sh, I encountered the error:

"no configuration file provided: not found."

I checked the slurmmaster logs and found errors there as well. shows:

`sudo: unable to resolve host slurmmaster: Name or service not known sudo: unable to resolve host slurmmaster: Temporary failure in name resolution

Can you help me with how to successfully run this test?

minyang-chen commented 1 month ago

hi, can you give a try following:

  1. connect to the accounting node (slurmdbd docker instance using docker-compose or docker exec) then try ping slurmmaster? hopefully this will tell us if master is reachable or not. or how the name resolution was done.

  2. once your at slurmdbd shell -- you can run the cluster registration command directly sacctmgr --immediate add cluster name=clusterlab

If still doesn't work, please share more details on your distributed test env. thanks