microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
184 stars 143 forks source link

Nexus VM gets wedged? #4074

Open TonyWildish-BH opened 2 months ago

TonyWildish-BH commented 2 months ago

Description

I'm trying to merge the current AzureTRE into my own repository to get the latest changes. The merge went smoothly, no conflicts, and now I'm testing it.

The issue I see is that the Nexus VM gets wedged after a while. I'm able to create one or two VMs, either Windows or Linux, and they work, booting to completion. However, if I deploy more VMs, they eventually get stuck, with Nexus failing to respond.

Restarting the Nexus VM clears things up for a while, but the problem recurs just a short while later, when I deploy more VMs.

I'm able to connect to the Nexus VM in the azure portal, via the bastion, but when the problem happens, that session gets wedged too. It's a whole-VM phenomenon.

I haven't changed anything relating to any shared services in my TRE, and in particular, I haven't touched Nexus at all, the configuration there is exactly as-is in this repo. So while I can't rule out that it's something I've done, I'm wondering if anyone else has seen this, or anything like it?

Any suggestions of what to look for would be greatly appreciated.

TonyWildish-BH commented 2 months ago

Update: This is reproducible in the current HEAD of this repository, so I'd like to redefine this as a bug, not a question.

tim-allen-ck commented 2 months ago

Hey @TonyWildish-BH what version of Nexus do you have deployed?

TonyWildish-BH commented 2 months ago

I pulled the HEAD a week ago, it's whatever's there, I haven't touched the nexus code. We did a separate test, pulling this repo yesterday, and that shows the same problem. That's why I'm thinking it's not anything I've done, since that second test had no modifications whatsoever w.r.t. this repo.

tim-allen-ck commented 2 months ago

Sure. What version is the nexus template?

TonyWildish-BH commented 2 months ago

3.0.0

tim-allen-ck commented 2 months ago

Thanks. I'll take a look, see if I can reproduce.

TonyWildish-BH commented 2 months ago

hi @tim-allen-ck, did you get a chance to look at this?

What I have found since is that the Windows VMs seem not to provoke the problem, though the Linux VMs definitely do. Probably because they have so much more to update than the Windows VMs.

akolensky commented 2 months ago

Hi @tim-allen-ck , I understand it is a busy season - and wondered if this has been looked into?

marrobi commented 2 months ago

@akolensky what troubleshooting steps have you tried? It's not something we've seen elsewhere.

TonyWildish-BH commented 2 months ago

All I've managed to deduce so far is that it seems to be related to the Linux VMs doing a mass update. The load average in the nexus container goes over 40, and it stops responding, completely - which isn't surprising at that load average.

Rebooting the nexus VM clears the issue, but a 'restart' in the portal doesn't work, because the VM doesn't respond to it, you have to 'stop' and 'start', which takes a very long time, usually.

The problem is repeatable, but not guaranteed. With a fresh install of nexus, it wedges about ⅔ of the time, on one of the first 2 or 3 Linux VMs - often the first. It's certainly not rare.

marrobi commented 2 months ago

Have you added some custom repositories?

We've got instances running elsewhere and Nexus have been working without issue for long periods. So something must be different in your instance.

Have you tried using a larger VM?

Might be the container needs some resource limits as to leave the host some resources.

TonyWildish-BH commented 2 months ago

Marcus, this is in fresh installations, predominantly. It looks like a first-time cache-filling problem where the requests are not throttled, and the server gets overloaded. After rebooting, it tends to behave itself, but still spits the dummy every now and then.

This happens in a virgin installation, with unmodified code, as stated. No custom anything. We see it in the pure MS code base, and also in our own, where we have not touched anything relating to nexus, or to any of the core resources.

This is repeatable, three different people using three different setups have seen it, including one outside Barts. It's not our environment.

I did try using a larger VM (64 GB x 8 cores), that didn't help.

Restricting the container isn't likely to help much, though it might let the host OS kill and restart it, at best. If the container is spawning > 40 threads, all bets are off, that's too many. My best guess is that the server needs throttling, which means either Nexus or Java VM configuration.

Do you know if Tim tried to reproduce it?

tim-allen-ck commented 2 months ago

Hi @TonyWildish-BH I've not been able to reproduce it. Was it only 1 or 2 VMs you'd deployed when you'd found the issue?

TonyWildish-BH commented 2 months ago

hi @tim-allen-ck, I've been able to reproduce it on the first Linux VM I boot in a new SDE. It happens about 50% of the time in that situation, more or less.

marrobi commented 1 month ago

What's the exact SKU you are using for the VM? What additional software is installed.

In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU.

It would be useful if the SKU was a parameter.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

TonyWildish-BH commented 1 month ago

This has all happened with a completely unmodified installation from the HEAD of this repository. A fresh checkout of the code, with nothing changed. Not the Nexus VM, not the Linux template I'm trying to boot from it. Nothing.

I set my config.yaml at the top level and install, from scratch, following the instructions. I create a Linux VM, and with high probability, Nexus wedges.

marrobi commented 1 month ago

What's the exact SKU you are using for the VM? What additional software is installed.

In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU.

It would be useful if the SKU was a parameter.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

@akolensky are you able to help @TonyWildish-BH answer my question above? Thanks.

TonyWildish-BH commented 1 month ago

Hi Marcus,

What's the exact SKU you are using for the VM? What additional software is installed.

SKU is 22_04-lts-gen2. As stated, there is no additional software installed. None.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

As stated, I'm seeing this error on multiple installations. One is our own, with custom VMs that have nearly all the packages installed, the other is the unmodified Microsoft codebase, commit hash c3e4c8db9b8e548c2d498e34a6aa4a5796852401. That uses a cloud-init script to update the vanilla OS which comes with the TRE.

I see the issue in both these environments, therefore, this is not an issue of customisation from our side.

marrobi commented 1 month ago

That's the image sku rather than VM SKU. The VM SKU will be a letter followed by number(s).

My thinking is you have something different going on the VM. Antivirus maybe? That in conjunction with the VM scripts is causing all the credits to be used on the B series Nexus VM.

In addition as per https://microsoft.github.io/AzureTRE/latest/tre-templates/user-resources/guacamole-linux-vm/ I suggest you use VM images in production.

TonyWildish-BH commented 1 month ago

Where do I find the VM SKU?

Whatever is happening on the VM is whatever happens out of the box, because we haven't modified it in any way at all. There is no customisation of the Nexus VM. We haven't changed anything there. We haven't installed anything extra. Nothing.

I'm aware of that recommendation, and we will indeed be using our own VM images, but I need this bug fixed before we can consider going into production.