Open fruch opened 2 years ago
There was an error even before:
14:13:17 Error when waiting for VM longevity-10gb-3h-master-db-node-eastus-2: (OperationPreempted) Operation execution has been preempted by a more recent operation.
14:13:17 Code: OperationPreempted
I'm not sure if it is SCT issue or rather Azure platform/drivers problem. There's not much information about OperationPreempted
error, but what I saw it looks that resources got deleted during creation.
From https://docs.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/error-code-operationpreempted:
This error usually occurs when an in-progress create operation is interrupted by a subsequent delete operation that was issued before the create cluster operation is finished.
but this relates to azure-kubernetes
and I'm not sure if it's the same.
Could be an early spot termination. Possibly now, when setting AZ we got less chance for successful deployment of resources now.
Lets wait to see if this reproduces.
@soyacz it reproduce again, on the same master job: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-10gb-3h-azure-test/71/
From Azure audit logs it looks that we hit spot eviction during creation of VM's. Recently more frequent as we limit resources to only one availability zone. We can try running this job in different AZ to see if it has more resources and less frequent spot evictions. We could also add ability to run without providing AZ's - like we did before, to achieve better test stability (at increased theoretical cost - 0.01$/GB transfer between AZ's - still we use Azure free subscription). Another improvement would be to switch from creating machines one by one, and use Azure Scale Sets - it won't increase stability, just will fail faster when there's not enough machines in the zone.
happen again on master run
Describe your issue in detail and steps it took to produce it.
Describe the impact this issue causes to the user.
Describe the frequency with how this issue can be reproduced.
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.2.0-dev-x86_64-2024-08-06T01-54-47
(azure: undefined_region)
Test: longevity-10gb-3h-azure-test
Test id: 4358ba9b-82d1-49d7-a299-a27540c43109
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Installation details
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image:
/subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.2.0-dev-x86_64-2022-08-29T11-23-01Z
(azure: eastus)Test:
longevity-10gb-3h-azure-test
Test id:e4aa60b7-3636-4cb0-a9db-17e76e8497fc
Test name:scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):Issue description
during test setup couldn't found the NIC
$ hydra investigate show-monitor e4aa60b7-3636-4cb0-a9db-17e76e8497fc
$ hydra investigate show-logs e4aa60b7-3636-4cb0-a9db-17e76e8497fc
Logs:
Jenkins job URL