microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
182 stars 139 forks source link

Troubleshooting Slurm Workload Manager Deployment Steps #4021

Open BiologyGeek opened 3 months ago

BiologyGeek commented 3 months ago

Hello team,

After resolving the Azure CycleCloud connectivity issue with subscription resources(https://github.com/microsoft/AzureTRE/issues/3933), a new problem is observed when attempting to run the Scheduler VM. Here is the error message:

image

image

Hint: During the Slurm setup, this Subnet ID was selected: mytre: vnet-mytre-SharedSubnet [10.1.1.0/24]

image

I am wondering what could be the root cause of this issue?

Danny-Cooke-CK commented 3 months ago

Hi @BiologyGeek. this looks still to be a network connection issue. "urllib2.URLError urlopen error errno 104 socket closed" also produces "connection reset by peer" typically.

First route to check is the Firewall as the request is trying to reach a URL so is that in the firewall to start with. Secondly would look at NSG's and routing but i think this will be firewall related

BiologyGeek commented 3 months ago

Hi @BiologyGeek. this looks still to be a network connection issue. "urllib2.URLError urlopen error errno 104 socket closed" also produces "connection reset by peer" typically.

First route to check is the Firewall as the request is trying to reach a URL so is that in the firewall to start with. Secondly would look at NSG's and routing but i think this will be firewall related

Thank you @Danny-Cooke-CK! I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was vnet-mytre-SharedSubnet [10.1.1.0/24].

marrobi commented 3 months ago

Ah, see https://microsoft.github.io/AzureTRE/latest/tre-templates/shared-services/cyclecloud/#create-a-cluster

Select your required settings. In the Subnet ID box, choose the ServicesSubnet in the resource group and virtual network containing the 4 digit workspace ID. Click Next.

tim-allen-ck commented 2 months ago

Thank you @Danny-Cooke-CK! I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was vnet-mytre-SharedSubnet [10.1.1.0/24].

Like Marcus said try using the {tre}-ws-{id}-ServicesSubnet Did you manage to get it working?

BiologyGeek commented 2 months ago

Ah, see https://microsoft.github.io/AzureTRE/latest/tre-templates/shared-services/cyclecloud/#create-a-cluster

Select your required settings. In the Subnet ID box, choose the ServicesSubnet in the resource group and virtual network containing the 4 digit workspace ID. Click Next.

Thank you @marrobi for highlighting this point! I conducted another attempt, but exact same error occurred.

Additional description: Here is what I did after deleting the previous Azure CycleCloud shared service and creating a new one from the TRE UI. I followed these steps carefully:

  1. Provided user details, including the SSH key, when creating the CycleCloud server instance.

  2. Selected the same region as my TRE deployment, left the resource group as the default "", and selected the storage account beginning with "stgcc".

  3. In the Subnet ID box, chose the {tre}-ws-{id}-ServicesSubnet.

  4. Under advanced settings, unchecked "Return Proxy" and "Public Head node".

  5. In the cloud-init section, I pasted the provided script with modified variables:

    TRE_ID="mytre"
    REGION="eastus"

    Note 1: I pasted the script in all tabs of the cloud-init section (scheduler, dynamic, hpc, htc, login, scheduler-ha). Is that the correct action? Note 2: I did not remove the quotes"" after TRE_ID= and REGION=. Should I do it like this instead: TRE_ID=mytre?

  6. Added a second user to the cluster with 'node access permission' and the same SSH public key as used when creating the CycleCloud server instance.

Question: What methods can help diagnose this issue?


Thank you @Danny-Cooke-CK! I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was vnet-mytre-SharedSubnet [10.1.1.0/24].

Like Marcus said try using the {tre}-ws-{id}-ServicesSubnet Did you manage to get it working?

Thank you @tim-allen-ck! I tried, but the same error messages were observed. Additionally, I still can't use Azure Bastion and see the same issue (https://github.com/microsoft/AzureTRE/issues/3933#issuecomment-2132414705) when attempting to use Bastion. I have been using a virtual machine within TRE and a browser with a private IP address to open the Azure CycleCloud page. Could this be a sign that something went wrong?


Guys, could you please kindly check if you can set up Slurm and verify if the node can spin on, or am I the only one experiencing this issue?

PoojanumN commented 2 months ago

@tim-allen-ck can you please look into this ?

tim-allen-ck commented 2 months ago

hi @BiologyGeek I'll have a look and give you an update by the end of the week.

BiologyGeek commented 2 months ago

@tim-allen-ck can you please look into this ?

hi @BiologyGeek I'll have a look and give you an update by the end of the week.

Thank you so much @PoojanumN and @tim-allen-ck! In the meantime, do you have any advice or tests that might help diagnose the issue?

tim-allen-ck commented 2 months ago

Hi @BiologyGeek I've managed to get a node deployed through much trial and error. There were a few things I did.

Please give those a try and see how they work for you.

BiologyGeek commented 2 months ago

Hi @BiologyGeek I've managed to get a node deployed through much trial and error. There were a few things I did.

* Changed the VM types to ones available in the quotas for my subscription.

* Enabled `Allow Blob anonymous access` on the storage account used by the cluster.

* Create the `cyclecloud` container with `Container` access level, in the storage account used by the cluster.

* For testing, I opened up the firewall to allow everything in and out so that slurm could deploy.

Please give those a try and see how they work for you.

Thank you @tim-allen-ck!

  1. For me, the stgcc{tre-id}{4 random characters} storage account has 'Blob anonymous access' set to Enabled by default. image

  2. Also, I've changed the cyclecloud container access level to Container access level. image

  3. Then added some new rules with these values:

image

image

image


Now, when clicking on 'Start', the nodes status changes to 'acquiring', then 'preparing':

image

However, the final status of the scheduler node gets stuck on 'Error configuring software': image

Here is the error message: image


Question: Did I set up the firewall correctly, or is another method recommended?

tim-allen-ck commented 2 months ago

Hi @BiologyGeek, it looks like you have set up the firewall correctly. I created a PR with the updated tf code for the Firewall. #4040 What size VMs are you deploying slurm with?

BiologyGeek commented 2 months ago

Hi @BiologyGeek, it looks like you have set up the firewall correctly. I created a PR with the updated tf code for the Firewall. #4040 What size VMs are you deploying slurm with?

Hi @tim-allen-ck, thank you!

This is the list of VM sizes I'm using for deploying Slurm:

marrobi commented 2 months ago

I've been looking into this for another scenario I am seeing traffic blocked outbound to locations such as azcopyvnext.azureedge.net. It looks like something has changed that means the cluster has external dependencies. Opening up this (azcopy) in the firewall is obviously a risk to data exfiltration.

Will have a dig and see if I can work out a way around it.

marrobi commented 2 months ago

Reproduced the error on the cluster itself: image

BiologyGeek commented 2 months ago

I've been looking into this for another scenario I am seeing traffic blocked outbound to locations such as azcopyvnext.azureedge.net. It looks like something has changed that means the cluster has external dependencies. Opening up this (azcopy) in the firewall is obviously a risk to data exfiltration.

Will have a dig and see if I can work out a way around it.

Thank you @marrobi! I'm wondering, even by allowing all connections through the firewall, I am still experiencing some issues. Could you please take a look and confirm if the method used here https://github.com/microsoft/AzureTRE/issues/4021#issuecomment-2238980844 is the right method to allow all connections and remove firewall impacts? Or is there another method you would suggest?

marrobi commented 2 months ago

Just sharing my work in progress, but getting further with using this as the init script, still working through issues:

#!/bin/sh
TRE_ID="mrtredemo2"
REGION="westeurope"

ls /etc/yum.repos.d/*.repo | xargs sed -i 's/mirrorlist/# mirrorlist/g'
ls /etc/yum.repos.d/*.repo | xargs sed -i "s,# baseurl=https://repo.almalinux.org/,baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/almalinux/,g"

yum -y install epel-release
ls /etc/yum.repos.d/*.repo | xargs sed -i 's/metalink/# metalink/g'
ls /etc/yum.repos.d/*.repo | xargs sed -i "s,#baseurl=https://download.example/pub/epel/,baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/fedoraproject/pub/epel/,g"

yum -y install python3 python3-pip

sudo tee /etc/pip.conf <<EOF
[global]
index = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/pypi/pypi
index-url = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/pypi/simple
trusted-host = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com
EOF

sudo cat > /etc/yum.repos.d/cyclecloud.repo <<EOF
[cyclecloud]
name=cyclecloud
baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/microsoft-yumrepos/cyclecloud
gpgcheck=1
gpgkey=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/microsoft-keys/microsoft.asc
EOF

rpm --import https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/almalinux/almalinux/RPM-GPG-KEY-AlmaLinux
marrobi commented 2 months ago

Can confirm that works:

image

Will try to get a PR with this in tomorrow.

marrobi commented 2 months ago

@BiologyGeek I've done this from scratch today and seemed to work fine. Can you try the new bundle in the PR and follow the docs steps, including updated init script.

If all works, then @tim-allen-ck suggest we get this merged as closes of a number of issues.

Thanks.