scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
56 stars 93 forks source link

Azure instance provisioning fails with "Operation could not be completed as it results in exceeding approved standardLSv3Family Cores quota" exception #9010

Open dimakr opened 5 days ago

dimakr commented 5 days ago

2 out of 3 builds of rolling-upgrade-azure-image-test test for 2024.1.11 patch release failed on provisioning db instances with the following error:

2024-09-29 05:18:03.824: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=e7c93c69-02c9-4cef-be46-459da8a4d466, source=UpgradeTest.SetUp()
exception=(OperationNotAllowed) Operation could not be completed as it results in exceeding approved standardLSv3Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 350, Current Usage: 344, Additional Required: 8, (Minimum) New Limit Required: 352. Setup Alerts when Quota reaches threshold. Learn more at https://aka.ms/quotamonitoringalerting . Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%226c268694-47ab-43ab-b306-3c5514bc4112%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardLSv3Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:352,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardLSv3Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests
Code: OperationNotAllowed
Message: Operation could not be completed as it results in exceeding approved standardLSv3Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 350, Current Usage: 344, Additional Required: 8, (Minimum) New Limit Required: 352. Setup Alerts when Quota reaches threshold. Learn more at https://aka.ms/quotamonitoringalerting . Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%226c268694-47ab-43ab-b306-3c5514bc4112%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardLSv3Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:352,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardLSv3Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests

Installation details

Cluster size: 4 nodes (Standard_L8s_v3)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: /CommunityGalleries/scylladb-7e8d8a04-23db-487d-87ec-0e175c0615bb/Images/scylla-enterprise-2023.1/Versions/2023.1.11 (azure: undefined_region)

Test: rolling-upgrade-azure-image-test Test id: a2350a9d-188d-4cfa-856b-716dea14cf91 Test name: enterprise-2024.1/rolling-upgrade/rolling-upgrade-azure-image-test Test method: upgrade_test.UpgradeTest.test_rolling_upgrade Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor a2350a9d-188d-4cfa-856b-716dea14cf91` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=a2350a9d-188d-4cfa-856b-716dea14cf91) - Show all stored logs command: `$ hydra investigate show-logs a2350a9d-188d-4cfa-856b-716dea14cf91` ## Logs: - **sct-runner-events-a2350a9d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a2350a9d-188d-4cfa-856b-716dea14cf91/20240929_051823/sct-runner-events-a2350a9d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a2350a9d-188d-4cfa-856b-716dea14cf91/20240929_051823/sct-runner-events-a2350a9d.tar.gz) - **sct-a2350a9d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a2350a9d-188d-4cfa-856b-716dea14cf91/20240929_051823/sct-a2350a9d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a2350a9d-188d-4cfa-856b-716dea14cf91/20240929_051823/sct-a2350a9d.log.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2024.1/job/rolling-upgrade/job/rolling-upgrade-azure-image-test/22/) [Argus](https://argus.scylladb.com/test/553d2d9e-b170-4ccd-b5c6-44f0673ea2d1/runs?additionalRuns[]=a2350a9d-188d-4cfa-856b-716dea14cf91)
dimakr commented 5 days ago

I did a quick check, in cloud usage reports, of how many QA instances we were having running in Azure during a few recent 2024.1 patch release testing phases (was checking only dates for 2024.1 releases, as the issue occurred during 2024.1.11 testing).

Usually it is executing on Sunday and we have 15-20 running QA instances (at the time the cloud usage report is collected), but on 2024-09-29, when this issue occurred, we had 31 running QA instance (and 43 total). So maybe its just a coincidence, when that many instances were running at the same time. Or maybe some new tests/configs were added for Azure and we need to increase quota for cores