scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 95 forks source link

AMI artifact test fallback_provision_type, fails to find AMI on retry #5645

Open fruch opened 1 year ago

fruch commented 1 year ago

Issue description

seems like the retry on different AZ that introduced in https://github.com/scylladb/scylla-cluster-tests/pull/5123 isn't working as expect, seems like it's missing the region somehow:

08:51:33  < t:2023-01-05 06:51:32,489 f:cluster_aws.py  l:278  c:sdcm.cluster         p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_low_price instance(s): Failed to get spot instances: capacity-not-available
08:51:33  < t:2023-01-05 06:51:32,491 f:cluster_aws.py  l:258  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create on_demand instance(s)
08:51:33  < t:2023-01-05 06:51:32,623 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2023-01-05 06:51:32.621: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=17106724-750f-4ef1-bed5-f61cd951c0e0, source=ArtifactsTest.SetUp()
08:51:33  < t:2023-01-05 06:51:32,623 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR > Exception in setUp. Will call tearDown < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR > Exception in setUp. Will call tearDown
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR > Traceback (most recent call last):
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/tester.py", line 151, in wrapper
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     return method(*args, **kwargs)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/utils/decorators.py", line 116, in inner
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     res = func(*args, **kwargs)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/tester.py", line 739, in setUp
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     self.init_resources()
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/tester.py", line 1696, in init_resources
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     self.get_cluster_aws(loader_info=loader_info, db_info=db_info,
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/tester.py", line 1202, in get_cluster_aws
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     self.db_cluster = create_cluster(db_type)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/tester.py", line 1164, in create_cluster
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     return _create_auto_zone_scylla_aws_cluster()
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/tester.py", line 1139, in _create_auto_zone_scylla_aws_cluster
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     return ScyllaAWSCluster(
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster_aws.py", line 769, in __init__
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     super().__init__(
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster.py", line 3727, in __init__
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     super().__init__(*args, **kwargs)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster_aws.py", line 105, in __init__
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     super().__init__(cluster_uuid=cluster_uuid,
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster.py", line 3132, in __init__
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     self.add_nodes(num, dc_idx=dc_idx, enable_auto_bootstrap=self.auto_bootstrap)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster_aws.py", line 807, in add_nodes
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     added_nodes = super().add_nodes(
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster_aws.py", line 378, in add_nodes
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     instances = self._create_or_find_instances(count=count, ec2_user_data=ec2_user_data, dc_idx=dc_idx)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster_aws.py", line 372, in _create_or_find_instances
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     return self._create_instances(count, ec2_user_data, dc_idx)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster_aws.py", line 221, in _create_instances
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     instances = self.fallback_provision_type(count, interfaces, ec2_user_data, dc_idx)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster_aws.py", line 260, in fallback_provision_type
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     instances = self._create_on_demand_instances(count, interfaces, ec2_user_data, dc_idx)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/home/jenkins/slave/workspace/scylla-master/artifacts/artifacts-ami-test/scylla-cluster-tests/sdcm/cluster_aws.py", line 146, in _create_on_demand_instances
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     instances = self._ec2_services[dc_idx].create_instances(**params)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/boto3/resources/factory.py", line 520, in do_action
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     response = action(self, *args, **kwargs)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/boto3/resources/action.py", line 83, in __call__
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     response = getattr(parent.meta.client, operation_name)(*args, **params)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 386, in _api_call
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     return self._make_api_call(operation_name, kwargs)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 705, in _make_api_call
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR >     raise error_class(parsed_response, operation_name)
08:51:33  < t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR > botocore.exceptions.ClientError: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist

Installation details

Cluster size: 1 nodes (i3.2xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: ami-02643ff1876ab4973 (aws: us-east-1)

Test: artifacts-ami-test Test id: c478a021-95e5-46ab-9646-641e1c831f47 Test name: scylla-master/artifacts/artifacts-ami-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor c478a021-95e5-46ab-9646-641e1c831f47` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=c478a021-95e5-46ab-9646-641e1c831f47) - Show all stored logs command: `$ hydra investigate show-logs c478a021-95e5-46ab-9646-641e1c831f47` ## Logs: - **sct-runner-c478a021.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c478a021-95e5-46ab-9646-641e1c831f47/20230105_065220/sct-runner-c478a021.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c478a021-95e5-46ab-9646-641e1c831f47/20230105_065220/sct-runner-c478a021.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/993/)
fgelcer commented 1 year ago

@fruch , an AMI should be available through all AZ in the same region... right? having that in mind, it means the retry is changing region, so the AMI is not found?

yarongilor commented 1 year ago

@fruch , @fgelcer , the issue doesn't look related to auto AZ. The test only looked at a specific region: us-east-1 AZ's and tried getting either a spot or an on-demand instance, then failed claiming for InvalidAMIID.NotFound. I rerun the job the same and it runs ok now in: https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/994/

so i'm not sure why the InvalidAMIID.NotFound. is it some kind of a race condition of AMI getting to ready state before the test is triggered or any other AWS hiccup.

The logs had:

< t:2023-01-05 06:51:26,660 f:ec2_client.py   l:100  c:sdcm.ec2_client      p:INFO  > Sending spot request with params: {
...
, 'AvailabilityZoneGroup': 'us-east-1a'}
...
< t:2023-01-05 06:51:32,623 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist

The AMI details are:


scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | ami-02643ff1876ab4973 | scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907 | Private | Available | 2023/01/05 08:32 GMT+2
-- | -- | -- | -- | -- | -- | -- | --

scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06
[ami-02643ff1876ab4973](https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#ImageDetails:imageId=ami-02643ff1876ab4973)   scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06    797456418907    Private 
 Available

2023/01/05 08:32 GMT+2
fruch commented 1 year ago

@fruch , @fgelcer , the issue doesn't look related to auto AZ. The test only looked at a specific region: us-east-1 AZ's and tried getting either a spot or an on-demand instance, then failed claiming for InvalidAMIID.NotFound. I rerun the job the same and it runs ok now in: https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/994/

this doesn't prove anything, there was no capacity issue when you run it

so i'm not sure why the InvalidAMIID.NotFound. is it some kind of a race condition of AMI getting to ready state before the test is triggered or any other AWS hiccup.

The logs had:

< t:2023-01-05 06:51:26,660 f:ec2_client.py   l:100  c:sdcm.ec2_client      p:INFO  > Sending spot request with params: {
...
, 'AvailabilityZoneGroup': 'us-east-1a'}
...
< t:2023-01-05 06:51:32,623 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist

The AMI details are:


scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | ami-02643ff1876ab4973 | scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907 | Private | Available | 2023/01/05 08:32 GMT+2
-- | -- | -- | -- | -- | -- | -- | --

scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06
[ami-02643ff1876ab4973](https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#ImageDetails:imageId=ami-02643ff1876ab4973) scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06    797456418907    Private 
 Available

2023/01/05 08:32 GMT+2

may guess that on retry is not configuring the region correctly don't know why, but the image exists for sure in the region, and the first attempt worked (failed cause of capacity) this happen on multiple jobs, it's not a one off

fruch commented 1 year ago

@fruch , an AMI should be available through all AZ in the same region... right? having that in mind, it means the retry is changing region, so the AMI is not found?

I think yes, don't know why... maybe it's using None, and falling back to default of the worker

yarongilor commented 1 year ago

@fruch , @fgelcer , the issue doesn't look related to auto AZ. The test only looked at a specific region: us-east-1 AZ's and tried getting either a spot or an on-demand instance, then failed claiming for InvalidAMIID.NotFound. I rerun the job the same and it runs ok now in: https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/994/

this doesn't prove anything, there was no capacity issue when you run it

so i'm not sure why the InvalidAMIID.NotFound. is it some kind of a race condition of AMI getting to ready state before the test is triggered or any other AWS hiccup. The logs had:

< t:2023-01-05 06:51:26,660 f:ec2_client.py   l:100  c:sdcm.ec2_client      p:INFO  > Sending spot request with params: {
...
, 'AvailabilityZoneGroup': 'us-east-1a'}
...
< t:2023-01-05 06:51:32,623 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist

The AMI details are:


scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | ami-02643ff1876ab4973 | scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907 | Private | Available | 2023/01/05 08:32 GMT+2
-- | -- | -- | -- | -- | -- | -- | --

scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06
[ami-02643ff1876ab4973](https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#ImageDetails:imageId=ami-02643ff1876ab4973)   scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06    797456418907    Private 
 Available

2023/01/05 08:32 GMT+2

may guess that on retry is not configuring the region correctly don't know why, but the image exists for sure in the region, and the first attempt worked (failed cause of capacity) this happen on multiple jobs, it's not a one off

@fruch , not true, there was no retry at all for AZs, only the provision_type retry. log:

yarongilor@yarongilor:~/Downloads/logs/sct-runner-c478a021$ egrep -i "Cluster artifacts-ami-jenkins-db-cluster-c478a021|botocore.exceptions.ClientError:" sct.log 
< t:2023-01-05 06:51:19,355 f:cluster.py      l:3110 c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Init nodes
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:371  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Found no provisioned instances. Provision them.
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:200  c:sdcm.cluster         p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Passing user_data 'Content-Type: multipart/mixed; boundary="===============3446273212289870019=="
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:213  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot instance(s)
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:242  c:sdcm.cluster         p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Instances provision fallbacks : ['spot_duration', 'spot_low_price', 'on_demand']
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:258  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot_duration instance(s)
< t:2023-01-05 06:51:26,530 f:cluster_aws.py  l:278  c:sdcm.cluster         p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_duration instance(s): Failed to get spot instances: capacity-not-available
< t:2023-01-05 06:51:26,531 f:cluster_aws.py  l:258  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot_low_price instance(s)
< t:2023-01-05 06:51:32,489 f:cluster_aws.py  l:278  c:sdcm.cluster         p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_low_price instance(s): Failed to get spot instances: capacity-not-available
< t:2023-01-05 06:51:32,491 f:cluster_aws.py  l:258  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create on_demand instance(s)
< t:2023-01-05 06:51:32,491 f:cluster_aws.py  l:134  c:sdcm.cluster         p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Creating 1 on-demand instances using AMI id 'ami-02643ff1876ab4973'... 
< t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR > botocore.exceptions.ClientError: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
~/Downloads/logs/sct-runner-c478a021$ grep "Failed creating a Scylla AWS cluster" sct.log
~/Downloads/logs/sct-runner-c478a021$ 

So if it is a matter of new sct code issue introduced, it should be around fallback_provision_type or similar.

fruch commented 1 year ago

@fruch , @fgelcer , the issue doesn't look related to auto AZ. The test only looked at a specific region: us-east-1 AZ's and tried getting either a spot or an on-demand instance, then failed claiming for InvalidAMIID.NotFound. I rerun the job the same and it runs ok now in: https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/994/

this doesn't prove anything, there was no capacity issue when you run it

so i'm not sure why the InvalidAMIID.NotFound. is it some kind of a race condition of AMI getting to ready state before the test is triggered or any other AWS hiccup. The logs had:

< t:2023-01-05 06:51:26,660 f:ec2_client.py   l:100  c:sdcm.ec2_client      p:INFO  > Sending spot request with params: {
...
, 'AvailabilityZoneGroup': 'us-east-1a'}
...
< t:2023-01-05 06:51:32,623 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist

The AMI details are:


scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | ami-02643ff1876ab4973 | scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907 | Private | Available | 2023/01/05 08:32 GMT+2
-- | -- | -- | -- | -- | -- | -- | --

scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06
[ami-02643ff1876ab4973](https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#ImageDetails:imageId=ami-02643ff1876ab4973) scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06    797456418907    Private 
 Available

2023/01/05 08:32 GMT+2

may guess that on retry is not configuring the region correctly don't know why, but the image exists for sure in the region, and the first attempt worked (failed cause of capacity) this happen on multiple jobs, it's not a one off

@fruch , not true, there was no retry at all for AZs, only the provision_type retry. log:

yarongilor@yarongilor:~/Downloads/logs/sct-runner-c478a021$ egrep -i "Cluster artifacts-ami-jenkins-db-cluster-c478a021|botocore.exceptions.ClientError:" sct.log 
< t:2023-01-05 06:51:19,355 f:cluster.py      l:3110 c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Init nodes
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:371  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Found no provisioned instances. Provision them.
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:200  c:sdcm.cluster         p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Passing user_data 'Content-Type: multipart/mixed; boundary="===============3446273212289870019=="
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:213  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot instance(s)
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:242  c:sdcm.cluster         p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Instances provision fallbacks : ['spot_duration', 'spot_low_price', 'on_demand']
< t:2023-01-05 06:51:20,399 f:cluster_aws.py  l:258  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot_duration instance(s)
< t:2023-01-05 06:51:26,530 f:cluster_aws.py  l:278  c:sdcm.cluster         p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_duration instance(s): Failed to get spot instances: capacity-not-available
< t:2023-01-05 06:51:26,531 f:cluster_aws.py  l:258  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot_low_price instance(s)
< t:2023-01-05 06:51:32,489 f:cluster_aws.py  l:278  c:sdcm.cluster         p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_low_price instance(s): Failed to get spot instances: capacity-not-available
< t:2023-01-05 06:51:32,491 f:cluster_aws.py  l:258  c:sdcm.cluster         p:INFO  > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create on_demand instance(s)
< t:2023-01-05 06:51:32,491 f:cluster_aws.py  l:134  c:sdcm.cluster         p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Creating 1 on-demand instances using AMI id 'ami-02643ff1876ab4973'... 
< t:2023-01-05 06:51:32,622 f:tester.py       l:158  c:sdcm.tester          p:ERROR > botocore.exceptions.ClientError: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
~/Downloads/logs/sct-runner-c478a021$ grep "Failed creating a Scylla AWS cluster" sct.log
~/Downloads/logs/sct-runner-c478a021$ 

So if it is a matter of new sct code issue introduced, it should be around fallback_provision_type or similar.

you are correct, seen _create_auto_zone_scylla_aws_cluster in the callstack, didn't had time to look into it further.

Anyhow still someone needs to investigate/fix it

fgelcer commented 1 year ago

my suggestion at this point is to add some prints, and once we reproduce the issue, we will have some more information... it looks like the fallback either removed the AMI id, or the region was changed while falling back... i confirmed with @yaronkaikov that it is not possible that the AMI was not available, as the packer requires the AMI to be available before moving one (and calling the test)

@yarongilor , please add some extra logs, if you still did not find the root cause of this issue

fgelcer commented 1 year ago

logs were added (#5653) but AFAICT the issue did not reproduce. moving this to on hold, until we find a way to reproduce it