scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

[Azure] Provision errors not propagated as events #6415

Open soyacz opened 1 year ago

soyacz commented 1 year ago

Issue description

Capacity error during provisioning stage was not published as Event:

06:09:14  sdcm.provision.provisioner.ProvisionError: (ZonalAllocationFailed) Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance
06:09:14  Code: ZonalAllocationFailed
06:09:14  Message: Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance
06:09:14  Error when waiting for VM sct-runner-1.6-instance-46d8a2c3: (ZonalAllocationFailed) Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance
06:09:14  Code: ZonalAllocationFailed
06:09:14  Message: Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance

Impact

making investigation necessary and taking our time

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (Standard_L8s_v3)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (NO RUNNER: NO RUNNER)

Test: rolling-upgrade-azure-image-test Test id: 46d8a2c3-b64c-420c-bc13-ea9426ffe49c Test name: scylla-master/rolling-upgrade/rolling-upgrade-azure-image-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 46d8a2c3-b64c-420c-bc13-ea9426ffe49c` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=46d8a2c3-b64c-420c-bc13-ea9426ffe49c) - Show all stored logs command: `$ hydra investigate show-logs 46d8a2c3-b64c-420c-bc13-ea9426ffe49c` ## Logs: *No logs captured during this run.* [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-azure-image-test/30/) [Argus](https://argus.scylladb.com/test/80412f0e-69be-4414-9913-9e124760b7b1/runs?additionalRuns[]=46d8a2c3-b64c-420c-bc13-ea9426ffe49c)
fruch commented 1 year ago

we don't have events system working in the provision step or the create runner step.

regardless we should do the call to argus to update about the failure, which seems we are not doing now.