scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
56 stars 93 forks source link

Azure 3h longevity - NIC NotFound #5198

Open fruch opened 2 years ago

fruch commented 2 years ago

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.2.0-dev-x86_64-2022-08-29T11-23-01Z (azure: eastus)

Test: longevity-10gb-3h-azure-test Test id: e4aa60b7-3636-4cb0-a9db-17e76e8497fc Test name: scylla-master/longevity/longevity-10gb-3h-azure-test Test config file(s):

Issue description

during test setup couldn't found the NIC

2022-08-29 12:28:10.907: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=300eafb8-afb4-460c-a0ed-8644e453bcb3, source=LongevityTest.SetUp()
exception=(NotFound) Resource /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCT-e4aa60b7-3636-4cb0-a9db-17e76e8497fc-eastus/providers/Microsoft.Network/networkInterfaces/longevity-10gb-3h-master-db-node-eastus-5-nic not found.
Code: NotFound
Message: Resource /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCT-e4aa60b7-3636-4cb0-a9db-17e76e8497fc-eastus/providers/Microsoft.Network/networkInterfaces/longevity-10gb-3h-master-db-node-eastus-5-nic not found.

Logs:

Jenkins job URL

soyacz commented 2 years ago

There was an error even before:

14:13:17  Error when waiting for VM longevity-10gb-3h-master-db-node-eastus-2: (OperationPreempted) Operation execution has been preempted by a more recent operation.
14:13:17  Code: OperationPreempted

I'm not sure if it is SCT issue or rather Azure platform/drivers problem. There's not much information about OperationPreempted error, but what I saw it looks that resources got deleted during creation.

From https://docs.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/error-code-operationpreempted:

This error usually occurs when an in-progress create operation is interrupted by a subsequent delete operation that was issued before the create cluster operation is finished.

but this relates to azure-kubernetes and I'm not sure if it's the same. Could be an early spot termination. Possibly now, when setting AZ we got less chance for successful deployment of resources now. Lets wait to see if this reproduces.

fruch commented 2 years ago

@soyacz it reproduce again, on the same master job: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-10gb-3h-azure-test/71/

soyacz commented 2 years ago

From Azure audit logs it looks that we hit spot eviction during creation of VM's. Recently more frequent as we limit resources to only one availability zone. We can try running this job in different AZ to see if it has more resources and less frequent spot evictions. We could also add ability to run without providing AZ's - like we did before, to achieve better test stability (at increased theoretical cost - 0.01$/GB transfer between AZ's - still we use Azure free subscription). Another improvement would be to switch from creating machines one by one, and use Azure Scale Sets - it won't increase stability, just will fail faster when there's not enough machines in the zone.

fruch commented 2 months ago

happen again on master run

Packages

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.2.0-dev-x86_64-2024-08-06T01-54-47 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test Test id: 4358ba9b-82d1-49d7-a299-a27540c43109 Test name: scylla-master/longevity/longevity-10gb-3h-azure-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 4358ba9b-82d1-49d7-a299-a27540c43109` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=4358ba9b-82d1-49d7-a299-a27540c43109) - Show all stored logs command: `$ hydra investigate show-logs 4358ba9b-82d1-49d7-a299-a27540c43109` ## Logs: - **db-cluster-4358ba9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/db-cluster-4358ba9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/db-cluster-4358ba9b.tar.gz) - **sct-runner-events-4358ba9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/sct-runner-events-4358ba9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/sct-runner-events-4358ba9b.tar.gz) - **sct-4358ba9b.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/sct-4358ba9b.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/sct-4358ba9b.log.tar.gz) - **loader-set-4358ba9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/loader-set-4358ba9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/loader-set-4358ba9b.tar.gz) - **monitor-set-4358ba9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/monitor-set-4358ba9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/monitor-set-4358ba9b.tar.gz) - **parallel-timelines-report-4358ba9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/parallel-timelines-report-4358ba9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4358ba9b-82d1-49d7-a299-a27540c43109/20240805_234329/parallel-timelines-report-4358ba9b.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-10gb-3h-azure-test/434/) [Argus](https://argus.scylladb.com/test/aa05fe67-89bd-4e30-88db-5e2b2e3d986d/runs?additionalRuns[]=4358ba9b-82d1-49d7-a299-a27540c43109)