microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
169 stars 133 forks source link

'Service Bus Namespace' Continues Running Even After `$ make tre-stop` #3953

Open BiologyGeek opened 1 month ago

BiologyGeek commented 1 month ago

Hello team,

Is it expected behavior for the 'Service Bus Namespace' to keep running even after executing the $ make tre-stop command?

This screenshot was captured after running the $ make tre-stop command: image

Given that the Premium tier of this service is not inexpensive, is there a way to turn it off or disable it when not needed?

jonnyry commented 1 month ago

It's not possible to temporarily stop the Service Bus (or suspend the billing) without deletion. Thread below when I posed a similar question:

https://github.com/microsoft/AzureTRE/issues/3782

BiologyGeek commented 4 weeks ago

It's not possible to temporarily stop the Service Bus (or suspend the billing) without deletion. Thread below when I posed a similar question:

3782

Thank you @jonnyry!

I deleted the 'Service Bus Namespace', but this resulted in abnormal activity in the 'Log Analytics workspace' and a lot of data ingestion, which caused a higher cost than the Service Bus Namespace itself.

Is there a way to prevent abnormal activity after removing the Service Bus Namespace? @marrobi

image

marrobi commented 4 weeks ago

@BiologyGeek I guess the logs are coming from the API web app and resource processor VMSS.

So if you stop both of them, as per the other issue you raised, stopping the web app won't save money, but would hopefully stop these errors being logged.

marrobi commented 4 weeks ago

It might be someone could look at using standard SKU service bus and having a config for users who don't require the service bus to be on a private network - for example for development purposes.

jonnyry commented 4 weeks ago

Would it possible to switch out for one of the other (less expensive) queue/event type Azure services... Queue Storage, Event Grid, Event Hubs... are there features/characteristics of the Service Bus that we specifically require?

marrobi commented 4 weeks ago

I think its session support. @damoodamoo @tamirkamara may be able to advise.

damoodamoo commented 4 weeks ago

we do require session support for ordered delivery unfortunately. think that's also part of standard SKU though, so a 'dev' switch to allow it to be deployed in standard would probably be the best bet...

jonnyry commented 5 days ago

Thanks @marrobi @damoodamoo

I'm guessing some additional network configuration would be required once the Service Bus SKU was switched to Standard, since the private endpoints & VNET integration are no longer available...

Would that do it?

In terms of locking down the source IPs/subnets, am I right in thinking the following components connect to the Service Bus?:

marrobi commented 5 days ago

I think that's it.

Re the firewall rules, you can do it in an ARM template ( https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-ip-filtering#use-resource-manager-template ), so if not supported in TF, would think can do it using AzAPI provider.

marrobi commented 5 days ago

think can do it here - https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/servicebus_namespace#ip_rules

jonnyry commented 5 days ago

Thanks.

Re opening the Azure Firewall outbound to the service bus FQDN, is looking trickier than I first thought -

Attempted to use a network rule with a Service Tag instead of FQDN:

image

An network rule on IP does work, however I'm imagining the IP will not stay same for long.

Not sure there's any easy solution to this one!

marrobi commented 5 days ago

Is the purpose of this PR to reduce costs when NOT in production? If so then does the IP filtering matter? As long as the none VNet service buss is only enabled by a clear flag.

jonnyry commented 5 days ago

Is the purpose of this PR to reduce costs when NOT in production?

Yes correct - to reduce the cost of dev/test instances.

Production would use premium SKUs (private endpoints/VNET integration etc).

If so then does the IP filtering matter? As long as the none VNet service buss is only enabled by a clear flag.

I suppose not (or less so anyway). The firewall still needs opening to allow traffic out to the service bus public IP - and would be preferable if it wasn't open to any destination.

marrobi commented 4 days ago

We already do this for local dev:

ip_rules = var.enable_local_debugging ? [local.myip] : null

I'm not sure it matters if its open to the internet for dev purposes?

jonnyry commented 4 days ago

No I agree, not that important.

However it's opening the firewall in the outbound direction to the service bus FQDN that's tricky...

image

It can be opened to the Service Bus IP, but it's not ideal - I don't know what the frequency of the IP changing is.

TonyWildish-BH commented 1 day ago

adding my £0.02, I don't think we need the Service Bus at all. It's just a FIFO queue, and there are much cheaper ways to implement that than a premium tier Service Bus, especially given that the traffic is so low that performance will never be an issue.

I also don't think it's a good idea to have different architectural flavours in dev/test vs. production, that's asking for trouble.

So I'd like to see this expense removed from the production instance(s), not just dev or test.

marrobi commented 1 day ago

@TonyWildish-BH There probably are other ways, but as with everything there is a time and effort to implement vs the actual cost of using the managed offering. Maybe you can suggest a design and submit a PR?

(agree test and prod should be consistent, but for dev, less so - we often develop using local compute for the API, resource processor etc so we can debug and have a shorter dev loop)

damoodamoo commented 1 day ago

@TonyWildish-BH We use Service Bus with sessions for guaranteed ordered delivery. This is required when multiple operations stack up against a single resource, and there are multiple nodes/threads servicing those requests.

I'm not hearing that the cost is really a factor in production, so if it's a case of saving costs in a dev then implementing a switch to use Standard SKU and skip a few PEs sounds pretty reasonable to me.

It's a pain to pay so much more for private networking, but it's definitely a requirement for most prod workloads.

jonnyry commented 1 day ago

adding my £0.02, I don't think we need the Service Bus at all. It's just a FIFO queue, and there are much cheaper ways to implement that than a premium tier Service Bus, especially given that the traffic is so low that performance will never be an issue.

I also don't think it's a good idea to have different architectural flavours in dev/test vs. production, that's asking for trouble.

So I'd like to see this expense removed from the production instance(s), not just dev or test.

@TonyWildish-BH yes I've recently come to that conclusion also. An enterprise message queue seems unnecessary (and costly) for tens or hundreds of messages a day.

I've parked trying to refactor the service bus to a use a Standard SKU for dev/test as there's too many gnarly changes required to make it work - as you say asking for trouble when your dev/test flavour is that different from prod.

@damoodamoo unfortunately more than just removing a few PEs. Here's are the key issues:

  1. Outbound Firewall rule. Service bus Standard runs on a public IP, therefore traffic requires a route out of the firewall. Not possible to lock down to an FQDN due to non 443 ports (without re-upgrading the firewall SKU which defeats the object). You can refactor the Service Bus to use AMQP over websockets to allow you use a FW Application Rule on the FQDN, but the library has a bug that causes this to fail: https://github.com/Azure/azure-sdk-for-python/issues/31067

  2. Firewall deployment catch 22 During initial deployment, the TRE is deployed (inc the service bus) before the firewall is in place. Therefore the resource processor has a connection out to the service bus without the firewall's routing rules in place. As the resource processor installs the firewall routing rules, the resources processor's service bus connection breaks (since it can no longer go direct to the internet). This causes all kinds of fun, with the resource processor stuck in a loop attempting to install the firewall. No easy way out of that, without a reasonable amount of rewrite.

  3. Large messages - according to this comment we might encounter large messages:

https://github.com/microsoft/AzureTRE/blob/9e49ed69d1ca074f7cae54647bd16001e59644a8/core/terraform/servicebus.tf#L43-L45

Service Bus Standard SKU won't cope with these.

TonyWildish-BH commented 1 day ago

thanks for the quick feedback. We've got it on our backlog to do something about the Service Bus, but it's not risen far enough up the stack yet, probably in a couple of months. Will be happy to post more details here when we get there.

damoodamoo commented 1 day ago

@jonnyry - thanks for the comments there, was a bunch of stuff i'd not realised.