microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
184 stars 143 forks source link

TRE cannot scale beyond about 32 projects #3920

Closed TonyWildish-BH closed 1 month ago

TonyWildish-BH commented 6 months ago

Description

In my Azure TRE deployment I am trying to test the limits of scalability, since we want to eventually run with up to 150 projects at a time. Yesterday, I created a large number of projects, and at about number 32, they started failing with the message Subscription ******* already contains 250 storage accounts with Standard Dns endpoints in location uksouth and the maximum allowed is 250.

From what I can understand, if the storage endpoints were to use AzureDnsZone endpoints instead of Standard, that will raise the limit to 5000 endpoints, which should be enough for us?

My question is, is it sufficient to update the storage.tf in various places to add dns_endpoint_type = AzureDnsZone, or is there some reason that won't work?

Steps

The steps I have tried are:

  1. create a workspace
  2. go to 1, until failure
  3. look at the error message

Code

n/a

marrobi commented 6 months ago

@TonyWildish-BH I presume you have other things than the TRE in the subscription? I have run automated tests for 40 plus workspaces which do complete. It's not a quota I've seen others hit.

@SvenAelterman do you know anything about this? https://techcommunity.microsoft.com/t5/azure-storage-blog/public-preview-create-additional-5000-azure-storage-accounts/ba-p/3465466

TonyWildish-BH commented 6 months ago

not much, there were 3 or 4 other workspaces and they had very little in them, maybe a VM, Guacamole, an ADF... They may have added to the storage account numbers, but the error message about the limit is clear enough, and we're going to hit it well before we reach production scale.

marrobi commented 6 months ago

Let us investigate, I take it the airlock is enabled on each of these workspaces - that could be the difference as it creates a number of storage accounts. Not sure to what scale that has been tested.

TonyWildish-BH commented 6 months ago

thanks. We do have the airlock enabled, we'll have that on all our workspaces.

SvenAelterman commented 6 months ago

@marrobi: I am familiar with the (still in preview) DNS-zone based solution to exceed the 250-account limit per subscription. However, just turning this on for the account creation would not work in my estimation because several other Azure services aren't yet capable of dealing with it, including those that TRE leverages. Also, the TRE code should be inspected to determine that there are no hardcoded references to the "blob.core.windows.net" DNS namespace.

@TonyWildish-BH: In the short term, I would recommend requesting an increase in the limit from 250 to 500 accounts per subscription per region using the process described here: https://learn.microsoft.com/azure/quotas/storage-account-quota-requests. This would then give you ~70 workspaces.

In the longer-term, once GA, TRE maintainers could evaluate using DNS-zone based storage accounts instead. However (for many other governance reasons), I would advocate for deploying workspaces in different subscriptions, which would also address this issue: #1073.

TonyWildish-BH commented 5 months ago

thanks for the reply, @SvenAelterman, a couple of follow-on questions:

I've already thrown my hat in the ring for deployment into different subscriptions, we'd like that so we can let people just spend their own money on their own subscription and not have to concern ourselves with their costs. I'm not aware of any timeline for that to happen, though.

Regarding the issue of inspecting the code base for hardcoded references, that's a generic issue, in that there are many places where object names are derived from parameters instead of looked up from the resource that created the object. It would be really nice to have that cleaned up, but that's also for the future.

marrobi commented 4 months ago

@TonyWildish-BH did you manage to increase your subscription storage account limit?

TonyWildish-BH commented 4 months ago

yes, thanks. I've not run a scaling test since, but I did manage to get the quota increased.


From: Marcus Robinson @.> Sent: 18 June 2024 10:29 To: microsoft/AzureTRE @.> Cc: WILDISH, Tony (BARTS HEALTH NHS TRUST) @.>; Mention @.> Subject: Re: [microsoft/AzureTRE] TRE cannot scale beyond about 32 projects (Issue #3920)

This message originated from outside of NHSmail. Please do not click links or open attachments unless you recognise the sender and know the content is safe.

@TonyWildish-BHhttps://github.com/TonyWildish-BH did you manage to increase your subscription storage account limit?

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/AzureTRE/issues/3920#issuecomment-2175638010, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BEQ2NMUCNEAXAWDATBIW6YLZH74WVAVCNFSM6AAAAABHFKIYKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZVGYZTQMBRGA. You are receiving this because you were mentioned.Message ID: @.***>


This message may contain confidential information. If you are not the intended recipient please: i) inform the sender that you have received the message in error before deleting it; and ii) do not disclose, copy or distribute information in this e-mail or take any action in relation to its content (to do so is strictly prohibited and may be unlawful). Thank you for your co-operation.

NHSmail is the secure email, collaboration and directory service available for all NHS staff in England. NHSmail is approved for exchanging patient data and other sensitive information with NHSmail and other accredited email services.

For more information and to find out how you can switch visit Joining NHSmail – NHSmail Supporthttps://support.nhs.net/article-categories/joining-nhsmail/

tim-allen-ck commented 3 months ago

Hi @TonyWildish-BH can you just confirm you managed to increase the workspace limit?

TonyWildish-BH commented 3 months ago

Hi @tim-allen-ck. I've not run a test yet, but unless there's another limit somewhere, it should be OK. We can close this ticket, if I hit another issue I can re-open or create a new ticket, as appropriate.

tim-allen-ck commented 3 months ago

Thanks, I'll update the docs to reference the limit then close this ticket