Open BiologyGeek opened 3 months ago
Hi @BiologyGeek. this looks still to be a network connection issue. "urllib2.URLError urlopen error errno 104 socket closed" also produces "connection reset by peer" typically.
First route to check is the Firewall as the request is trying to reach a URL so is that in the firewall to start with. Secondly would look at NSG's and routing but i think this will be firewall related
Hi @BiologyGeek. this looks still to be a network connection issue. "urllib2.URLError urlopen error errno 104 socket closed" also produces "connection reset by peer" typically.
First route to check is the Firewall as the request is trying to reach a URL so is that in the firewall to start with. Secondly would look at NSG's and routing but i think this will be firewall related
Thank you @Danny-Cooke-CK!
I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was vnet-mytre-SharedSubnet [10.1.1.0/24]
.
Select your required settings. In the Subnet ID box, choose the ServicesSubnet in the resource group and virtual network containing the 4 digit workspace ID. Click Next.
Thank you @Danny-Cooke-CK! I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was
vnet-mytre-SharedSubnet [10.1.1.0/24]
.
Like Marcus said try using the {tre}-ws-{id}-ServicesSubnet
Did you manage to get it working?
Select your required settings. In the Subnet ID box, choose the ServicesSubnet in the resource group and virtual network containing the 4 digit workspace ID. Click Next.
Thank you @marrobi for highlighting this point! I conducted another attempt, but exact same error occurred.
Additional description: Here is what I did after deleting the previous Azure CycleCloud shared service and creating a new one from the TRE UI. I followed these steps carefully:
Provided user details, including the SSH key, when creating the CycleCloud server instance.
Selected the same region as my TRE deployment, left the resource group as the default "
In the Subnet ID box, chose the {tre}-ws-{id}-ServicesSubnet
.
Under advanced settings, unchecked "Return Proxy" and "Public Head node".
In the cloud-init section, I pasted the provided script with modified variables:
TRE_ID="mytre"
REGION="eastus"
Note 1: I pasted the script in all tabs of the cloud-init section (scheduler, dynamic, hpc, htc, login, scheduler-ha).
Is that the correct action?
Note 2: I did not remove the quotes"" after TRE_ID=
and REGION=
.
Should I do it like this instead: TRE_ID=mytre
?
Added a second user to the cluster with 'node access permission' and the same SSH public key as used when creating the CycleCloud server instance.
Question: What methods can help diagnose this issue?
Thank you @Danny-Cooke-CK! I will go through checking these items, but my quick question is: Did I select the correct subnet ID during the selection process? It was
vnet-mytre-SharedSubnet [10.1.1.0/24]
.Like Marcus said try using the
{tre}-ws-{id}-ServicesSubnet
Did you manage to get it working?
Thank you @tim-allen-ck! I tried, but the same error messages were observed. Additionally, I still can't use Azure Bastion and see the same issue (https://github.com/microsoft/AzureTRE/issues/3933#issuecomment-2132414705) when attempting to use Bastion. I have been using a virtual machine within TRE and a browser with a private IP address to open the Azure CycleCloud page. Could this be a sign that something went wrong?
Guys, could you please kindly check if you can set up Slurm and verify if the node can spin on, or am I the only one experiencing this issue?
@tim-allen-ck can you please look into this ?
hi @BiologyGeek I'll have a look and give you an update by the end of the week.
@tim-allen-ck can you please look into this ?
hi @BiologyGeek I'll have a look and give you an update by the end of the week.
Thank you so much @PoojanumN and @tim-allen-ck! In the meantime, do you have any advice or tests that might help diagnose the issue?
Hi @BiologyGeek I've managed to get a node deployed through much trial and error. There were a few things I did.
Allow Blob anonymous access
on the storage account used by the cluster. cyclecloud
container with Container
access level, in the storage account used by the cluster. Please give those a try and see how they work for you.
Hi @BiologyGeek I've managed to get a node deployed through much trial and error. There were a few things I did.
* Changed the VM types to ones available in the quotas for my subscription. * Enabled `Allow Blob anonymous access` on the storage account used by the cluster. * Create the `cyclecloud` container with `Container` access level, in the storage account used by the cluster. * For testing, I opened up the firewall to allow everything in and out so that slurm could deploy.
Please give those a try and see how they work for you.
Thank you @tim-allen-ck!
For me, the stgcc{tre-id}{4 random characters}
storage account has 'Blob anonymous access' set to Enabled by default.
Also, I've changed the cyclecloud
container access level to Container
access level.
Then added some new rules with these values:
Now, when clicking on 'Start', the nodes status changes to 'acquiring', then 'preparing':
However, the final status of the scheduler node gets stuck on 'Error configuring software':
Here is the error message:
Question: Did I set up the firewall correctly, or is another method recommended?
Hi @BiologyGeek, it looks like you have set up the firewall correctly. I created a PR with the updated tf code for the Firewall. #4040 What size VMs are you deploying slurm with?
Hi @BiologyGeek, it looks like you have set up the firewall correctly. I created a PR with the updated tf code for the Firewall. #4040 What size VMs are you deploying slurm with?
Hi @tim-allen-ck, thank you!
This is the list of VM sizes I'm using for deploying Slurm:
I've been looking into this for another scenario I am seeing traffic blocked outbound to locations such as azcopyvnext.azureedge.net
. It looks like something has changed that means the cluster has external dependencies. Opening up this (azcopy) in the firewall is obviously a risk to data exfiltration.
Will have a dig and see if I can work out a way around it.
Reproduced the error on the cluster itself:
I've been looking into this for another scenario I am seeing traffic blocked outbound to locations such as
azcopyvnext.azureedge.net
. It looks like something has changed that means the cluster has external dependencies. Opening up this (azcopy) in the firewall is obviously a risk to data exfiltration.Will have a dig and see if I can work out a way around it.
Thank you @marrobi! I'm wondering, even by allowing all connections through the firewall, I am still experiencing some issues. Could you please take a look and confirm if the method used here https://github.com/microsoft/AzureTRE/issues/4021#issuecomment-2238980844 is the right method to allow all connections and remove firewall impacts? Or is there another method you would suggest?
Just sharing my work in progress, but getting further with using this as the init script, still working through issues:
#!/bin/sh
TRE_ID="mrtredemo2"
REGION="westeurope"
ls /etc/yum.repos.d/*.repo | xargs sed -i 's/mirrorlist/# mirrorlist/g'
ls /etc/yum.repos.d/*.repo | xargs sed -i "s,# baseurl=https://repo.almalinux.org/,baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/almalinux/,g"
yum -y install epel-release
ls /etc/yum.repos.d/*.repo | xargs sed -i 's/metalink/# metalink/g'
ls /etc/yum.repos.d/*.repo | xargs sed -i "s,#baseurl=https://download.example/pub/epel/,baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/fedoraproject/pub/epel/,g"
yum -y install python3 python3-pip
sudo tee /etc/pip.conf <<EOF
[global]
index = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/pypi/pypi
index-url = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/pypi/simple
trusted-host = https://nexus-$TRE_ID.$REGION.cloudapp.azure.com
EOF
sudo cat > /etc/yum.repos.d/cyclecloud.repo <<EOF
[cyclecloud]
name=cyclecloud
baseurl=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/microsoft-yumrepos/cyclecloud
gpgcheck=1
gpgkey=https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/microsoft-keys/microsoft.asc
EOF
rpm --import https://nexus-$TRE_ID.$REGION.cloudapp.azure.com/repository/almalinux/almalinux/RPM-GPG-KEY-AlmaLinux
Can confirm that works:
Will try to get a PR with this in tomorrow.
@BiologyGeek I've done this from scratch today and seemed to work fine. Can you try the new bundle in the PR and follow the docs steps, including updated init script.
If all works, then @tim-allen-ck suggest we get this merged as closes of a number of issues.
Thanks.
Hello team,
After resolving the Azure CycleCloud connectivity issue with subscription resources(https://github.com/microsoft/AzureTRE/issues/3933), a new problem is observed when attempting to run the Scheduler VM. Here is the error message:
Hint: During the Slurm setup, this Subnet ID was selected: mytre: vnet-mytre-SharedSubnet [10.1.1.0/24]
I am wondering what could be the root cause of this issue?