Troubleshooting Slurm Workload Manager Deployment Steps

BiologyGeek commented 3 months ago

Hello team,

After resolving the Azure CycleCloud connectivity issue with subscription resources(https://github.com/microsoft/AzureTRE/issues/3933), a new problem is observed when attempting to run the Scheduler VM. Here is the error message:

Hint: During the Slurm setup, this Subnet ID was selected: mytre: vnet-mytre-SharedSubnet [10.1.1.0/24]

I am wondering what could be the root cause of this issue?

Danny-Cooke-CK commented 3 months ago

Hi @BiologyGeek. this looks still to be a network connection issue. "urllib2.URLError urlopen error errno 104 socket closed" also produces "connection reset by peer" typically.

First route to check is the Firewall as the request is trying to reach a URL so is that in the firewall to start with. Secondly would look at NSG's and routing but i think this will be firewall related

BiologyGeek commented 3 months ago

Hi @BiologyGeek. this looks still to be a network connection issue. "urllib2.URLError urlopen error errno 104 socket closed" also produces "connection reset by peer" typically.

First route to check is the Firewall as the request is trying to reach a URL so is that in the firewall to start with. Secondly would look at NSG's and routing but i think this will be firewall related

Thank you @Danny-Cooke-CK! I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was vnet-mytre-SharedSubnet [10.1.1.0/24].

marrobi commented 3 months ago

Ah, see https://microsoft.github.io/AzureTRE/latest/tre-templates/shared-services/cyclecloud/#create-a-cluster

Select your required settings. In the Subnet ID box, choose the ServicesSubnet in the resource group and virtual network containing the 4 digit workspace ID. Click Next.

tim-allen-ck commented 2 months ago

Thank you @Danny-Cooke-CK! I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was vnet-mytre-SharedSubnet [10.1.1.0/24].

Like Marcus said try using the {tre}-ws-{id}-ServicesSubnet Did you manage to get it working?

BiologyGeek commented 2 months ago

Ah, see https://microsoft.github.io/AzureTRE/latest/tre-templates/shared-services/cyclecloud/#create-a-cluster

Select your required settings. In the Subnet ID box, choose the ServicesSubnet in the resource group and virtual network containing the 4 digit workspace ID. Click Next.

Thank you @marrobi for highlighting this point! I conducted another attempt, but exact same error occurred.

Additional description: Here is what I did after deleting the previous Azure CycleCloud shared service and creating a new one from the TRE UI. I followed these steps carefully:

Provided user details, including the SSH key, when creating the CycleCloud server instance.
Selected the same region as my TRE deployment, left the resource group as the default "", and selected the storage account beginning with "stgcc".
In the Subnet ID box, chose the {tre}-ws-{id}-ServicesSubnet.
Under advanced settings, unchecked "Return Proxy" and "Public Head node".
In the cloud-init section, I pasted the provided script with modified variables:
```
TRE_ID="mytre"
REGION="eastus"
```
Note 1: I pasted the script in all tabs of the cloud-init section (scheduler, dynamic, hpc, htc, login, scheduler-ha). Is that the correct action? Note 2: I did not remove the quotes"" after TRE_ID= and REGION=. Should I do it like this instead: TRE_ID=mytre?
Added a second user to the cluster with 'node access permission' and the same SSH public key as used when creating the CycleCloud server instance.

Question: What methods can help diagnose this issue?

Thank you @Danny-Cooke-CK! I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was vnet-mytre-SharedSubnet [10.1.1.0/24].

Like Marcus said try using the {tre}-ws-{id}-ServicesSubnet Did you manage to get it working?

Thank you @tim-allen-ck! I tried, but the same error messages were observed. Additionally, I still can't use Azure Bastion and see the same issue (https://github.com/microsoft/AzureTRE/issues/3933#issuecomment-2132414705) when attempting to use Bastion. I have been using a virtual machine within TRE and a browser with a private IP address to open the Azure CycleCloud page. Could this be a sign that something went wrong?

Guys, could you please kindly check if you can set up Slurm and verify if the node can spin on, or am I the only one experiencing this issue?

PoojanumN commented 2 months ago

@tim-allen-ck can you please look into this ?

tim-allen-ck commented 2 months ago

hi @BiologyGeek I'll have a look and give you an update by the end of the week.

BiologyGeek commented 2 months ago

@tim-allen-ck can you please look into this ?

hi @BiologyGeek I'll have a look and give you an update by the end of the week.

Thank you so much @PoojanumN and @tim-allen-ck! In the meantime, do you have any advice or tests that might help diagnose the issue?

tim-allen-ck commented 2 months ago

Hi @BiologyGeek I've managed to get a node deployed through much trial and error. There were a few things I did.

Changed the VM types to ones available in the quotas for my subscription.
Enabled Allow Blob anonymous access on the storage account used by the cluster.
Create the cyclecloud container with Container access level, in the storage account used by the cluster.
For testing, I opened up the firewall to allow everything in and out so that slurm could deploy.

Please give those a try and see how they work for you.

BiologyGeek commented 2 months ago

Hi @BiologyGeek I've managed to get a node deployed through much trial and error. There were a few things I did.
* Changed the VM types to ones available in the quotas for my subscription.

* Enabled `Allow Blob anonymous access` on the storage account used by the cluster.

* Create the `cyclecloud` container with `Container` access level, in the storage account used by the cluster.

* For testing, I opened up the firewall to allow everything in and out so that slurm could deploy.
Please give those a try and see how they work for you.

Thank you @tim-allen-ck!

For me, the stgcc{tre-id}{4 random characters} storage account has 'Blob anonymous access' set to Enabled by default.
Also, I've changed the cyclecloud container access level to Container access level.
Then added some new rules with these values:

Now, when clicking on 'Start', the nodes status changes to 'acquiring', then 'preparing':

However, the final status of the scheduler node gets stuck on 'Error configuring software':

Here is the error message:

Question: Did I set up the firewall correctly, or is another method recommended?

tim-allen-ck commented 2 months ago

Hi @BiologyGeek, it looks like you have set up the firewall correctly. I created a PR with the updated tf code for the Firewall. #4040 What size VMs are you deploying slurm with?

BiologyGeek commented 2 months ago

Hi @BiologyGeek, it looks like you have set up the firewall correctly. I created a PR with the updated tf code for the Firewall. #4040 What size VMs are you deploying slurm with?

Hi @tim-allen-ck, thank you!

This is the list of VM sizes I'm using for deploying Slurm:

Azure CycleCloud VM (Deployed by TRE itself within Azure): Standard_DS3_v2
Scheduler VM Type: Standard_D4ads_v5
Login Node VM Type: Standard_D8as_v4
HPC VM Type: Standard_F2s_v2
HTC VM Type: Standard_F2s_v2
Dyn VM Type: Standard_F2s_v2

marrobi commented 2 months ago

I've been looking into this for another scenario I am seeing traffic blocked outbound to locations such as azcopyvnext.azureedge.net. It looks like something has changed that means the cluster has external dependencies. Opening up this (azcopy) in the firewall is obviously a risk to data exfiltration.

Will have a dig and see if I can work out a way around it.

marrobi commented 2 months ago

Reproduced the error on the cluster itself:

BiologyGeek commented 2 months ago

I've been looking into this for another scenario I am seeing traffic blocked outbound to locations such as azcopyvnext.azureedge.net. It looks like something has changed that means the cluster has external dependencies. Opening up this (azcopy) in the firewall is obviously a risk to data exfiltration.

Will have a dig and see if I can work out a way around it.

Thank you @marrobi! I'm wondering, even by allowing all connections through the firewall, I am still experiencing some issues. Could you please take a look and confirm if the method used here https://github.com/microsoft/AzureTRE/issues/4021#issuecomment-2238980844 is the right method to allow all connections and remove firewall impacts? Or is there another method you would suggest?

marrobi commented 2 months ago

Just sharing my work in progress, but getting further with using this as the init script, still working through issues:

#!/bin/sh
TRE_ID="mrtredemo2"
REGION="westeurope"

ls /etc/yum.repos.d/*.repo | xargs sed -i 's/mirrorlist/# mirrorlist/g'
ls /etc/yum.repos.d/*.repo | xargs sed -i "s,# baseurl=https://repo.almalinux.org/,baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/almalinux/,g"

yum -y install epel-release
ls /etc/yum.repos.d/*.repo | xargs sed -i 's/metalink/# metalink/g'
ls /etc/yum.repos.d/*.repo | xargs sed -i "s,#baseurl=https://download.example/pub/epel/,baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/fedoraproject/pub/epel/,g"

yum -y install python3 python3-pip

sudo tee /etc/pip.conf <<EOF
[global]
index = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/pypi/pypi
index-url = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/pypi/simple
trusted-host = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com
EOF

sudo cat > /etc/yum.repos.d/cyclecloud.repo <<EOF
[cyclecloud]
name=cyclecloud
baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/microsoft-yumrepos/cyclecloud
gpgcheck=1
gpgkey=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/microsoft-keys/microsoft.asc
EOF

rpm --import https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/almalinux/almalinux/RPM-GPG-KEY-AlmaLinux

marrobi commented 2 months ago

Can confirm that works:

Will try to get a PR with this in tomorrow.

marrobi commented 2 months ago

@BiologyGeek I've done this from scratch today and seemed to work fine. Can you try the new bundle in the PR and follow the docs steps, including updated init script.

If all works, then @tim-allen-ck suggest we get this merged as closes of a number of issues.

Thanks.

microsoft / AzureTRE

Troubleshooting Slurm Workload Manager Deployment Steps #4021