Open fruch opened 1 year ago
@fruch , an AMI should be available through all AZ in the same region... right? having that in mind, it means the retry is changing region, so the AMI is not found?
@fruch , @fgelcer , the issue doesn't look related to auto AZ. The test only looked at a specific region: us-east-1 AZ's and tried getting either a spot or an on-demand instance, then failed claiming for InvalidAMIID.NotFound. I rerun the job the same and it runs ok now in: https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/994/
so i'm not sure why the InvalidAMIID.NotFound. is it some kind of a race condition of AMI getting to ready state before the test is triggered or any other AWS hiccup.
The logs had:
< t:2023-01-05 06:51:26,660 f:ec2_client.py l:100 c:sdcm.ec2_client p:INFO > Sending spot request with params: {
...
, 'AvailabilityZoneGroup': 'us-east-1a'}
...
< t:2023-01-05 06:51:32,623 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
The AMI details are:
scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | ami-02643ff1876ab4973 | scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907 | Private | Available | 2023/01/05 08:32 GMT+2
-- | -- | -- | -- | -- | -- | -- | --
scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06
[ami-02643ff1876ab4973](https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#ImageDetails:imageId=ami-02643ff1876ab4973) scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907 Private
Available
2023/01/05 08:32 GMT+2
@fruch , @fgelcer , the issue doesn't look related to auto AZ. The test only looked at a specific region: us-east-1 AZ's and tried getting either a spot or an on-demand instance, then failed claiming for InvalidAMIID.NotFound. I rerun the job the same and it runs ok now in: https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/994/
this doesn't prove anything, there was no capacity issue when you run it
so i'm not sure why the InvalidAMIID.NotFound. is it some kind of a race condition of AMI getting to ready state before the test is triggered or any other AWS hiccup.
The logs had:
< t:2023-01-05 06:51:26,660 f:ec2_client.py l:100 c:sdcm.ec2_client p:INFO > Sending spot request with params: { ... , 'AvailabilityZoneGroup': 'us-east-1a'} ... < t:2023-01-05 06:51:32,623 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
The AMI details are:
scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | ami-02643ff1876ab4973 | scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907 | Private | Available | 2023/01/05 08:32 GMT+2 -- | -- | -- | -- | -- | -- | -- | -- scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 [ami-02643ff1876ab4973](https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#ImageDetails:imageId=ami-02643ff1876ab4973) scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907 Private Available 2023/01/05 08:32 GMT+2
may guess that on retry is not configuring the region correctly don't know why, but the image exists for sure in the region, and the first attempt worked (failed cause of capacity) this happen on multiple jobs, it's not a one off
@fruch , an AMI should be available through all AZ in the same region... right? having that in mind, it means the retry is changing region, so the AMI is not found?
I think yes, don't know why... maybe it's using None, and falling back to default of the worker
@fruch , @fgelcer , the issue doesn't look related to auto AZ. The test only looked at a specific region: us-east-1 AZ's and tried getting either a spot or an on-demand instance, then failed claiming for InvalidAMIID.NotFound. I rerun the job the same and it runs ok now in: https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/994/
this doesn't prove anything, there was no capacity issue when you run it
so i'm not sure why the InvalidAMIID.NotFound. is it some kind of a race condition of AMI getting to ready state before the test is triggered or any other AWS hiccup. The logs had:
< t:2023-01-05 06:51:26,660 f:ec2_client.py l:100 c:sdcm.ec2_client p:INFO > Sending spot request with params: { ... , 'AvailabilityZoneGroup': 'us-east-1a'} ... < t:2023-01-05 06:51:32,623 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
The AMI details are:
scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | ami-02643ff1876ab4973 | scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907 | Private | Available | 2023/01/05 08:32 GMT+2 -- | -- | -- | -- | -- | -- | -- | -- scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 [ami-02643ff1876ab4973](https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#ImageDetails:imageId=ami-02643ff1876ab4973) scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907 Private Available 2023/01/05 08:32 GMT+2
may guess that on retry is not configuring the region correctly don't know why, but the image exists for sure in the region, and the first attempt worked (failed cause of capacity) this happen on multiple jobs, it's not a one off
@fruch , not true, there was no retry at all for AZs, only the provision_type retry. log:
yarongilor@yarongilor:~/Downloads/logs/sct-runner-c478a021$ egrep -i "Cluster artifacts-ami-jenkins-db-cluster-c478a021|botocore.exceptions.ClientError:" sct.log
< t:2023-01-05 06:51:19,355 f:cluster.py l:3110 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Init nodes
< t:2023-01-05 06:51:20,399 f:cluster_aws.py l:371 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Found no provisioned instances. Provision them.
< t:2023-01-05 06:51:20,399 f:cluster_aws.py l:200 c:sdcm.cluster p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Passing user_data 'Content-Type: multipart/mixed; boundary="===============3446273212289870019=="
< t:2023-01-05 06:51:20,399 f:cluster_aws.py l:213 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot instance(s)
< t:2023-01-05 06:51:20,399 f:cluster_aws.py l:242 c:sdcm.cluster p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Instances provision fallbacks : ['spot_duration', 'spot_low_price', 'on_demand']
< t:2023-01-05 06:51:20,399 f:cluster_aws.py l:258 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot_duration instance(s)
< t:2023-01-05 06:51:26,530 f:cluster_aws.py l:278 c:sdcm.cluster p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_duration instance(s): Failed to get spot instances: capacity-not-available
< t:2023-01-05 06:51:26,531 f:cluster_aws.py l:258 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot_low_price instance(s)
< t:2023-01-05 06:51:32,489 f:cluster_aws.py l:278 c:sdcm.cluster p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_low_price instance(s): Failed to get spot instances: capacity-not-available
< t:2023-01-05 06:51:32,491 f:cluster_aws.py l:258 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create on_demand instance(s)
< t:2023-01-05 06:51:32,491 f:cluster_aws.py l:134 c:sdcm.cluster p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Creating 1 on-demand instances using AMI id 'ami-02643ff1876ab4973'...
< t:2023-01-05 06:51:32,622 f:tester.py l:158 c:sdcm.tester p:ERROR > botocore.exceptions.ClientError: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
~/Downloads/logs/sct-runner-c478a021$ grep "Failed creating a Scylla AWS cluster" sct.log
~/Downloads/logs/sct-runner-c478a021$
So if it is a matter of new sct code issue introduced, it should be around fallback_provision_type
or similar.
@fruch , @fgelcer , the issue doesn't look related to auto AZ. The test only looked at a specific region: us-east-1 AZ's and tried getting either a spot or an on-demand instance, then failed claiming for InvalidAMIID.NotFound. I rerun the job the same and it runs ok now in: https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/994/
this doesn't prove anything, there was no capacity issue when you run it
so i'm not sure why the InvalidAMIID.NotFound. is it some kind of a race condition of AMI getting to ready state before the test is triggered or any other AWS hiccup. The logs had:
< t:2023-01-05 06:51:26,660 f:ec2_client.py l:100 c:sdcm.ec2_client p:INFO > Sending spot request with params: { ... , 'AvailabilityZoneGroup': 'us-east-1a'} ... < t:2023-01-05 06:51:32,623 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:ERROR > exception=An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
The AMI details are:
scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | ami-02643ff1876ab4973 | scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 | 797456418907 | Private | Available | 2023/01/05 08:32 GMT+2 -- | -- | -- | -- | -- | -- | -- | -- scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 [ami-02643ff1876ab4973](https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#ImageDetails:imageId=ami-02643ff1876ab4973) scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907/scylla-5.2.0-dev-x86_64-2023-01-05T08-26-06 797456418907 Private Available 2023/01/05 08:32 GMT+2
may guess that on retry is not configuring the region correctly don't know why, but the image exists for sure in the region, and the first attempt worked (failed cause of capacity) this happen on multiple jobs, it's not a one off
@fruch , not true, there was no retry at all for AZs, only the provision_type retry. log:
yarongilor@yarongilor:~/Downloads/logs/sct-runner-c478a021$ egrep -i "Cluster artifacts-ami-jenkins-db-cluster-c478a021|botocore.exceptions.ClientError:" sct.log < t:2023-01-05 06:51:19,355 f:cluster.py l:3110 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Init nodes < t:2023-01-05 06:51:20,399 f:cluster_aws.py l:371 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Found no provisioned instances. Provision them. < t:2023-01-05 06:51:20,399 f:cluster_aws.py l:200 c:sdcm.cluster p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Passing user_data 'Content-Type: multipart/mixed; boundary="===============3446273212289870019==" < t:2023-01-05 06:51:20,399 f:cluster_aws.py l:213 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot instance(s) < t:2023-01-05 06:51:20,399 f:cluster_aws.py l:242 c:sdcm.cluster p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Instances provision fallbacks : ['spot_duration', 'spot_low_price', 'on_demand'] < t:2023-01-05 06:51:20,399 f:cluster_aws.py l:258 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot_duration instance(s) < t:2023-01-05 06:51:26,530 f:cluster_aws.py l:278 c:sdcm.cluster p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_duration instance(s): Failed to get spot instances: capacity-not-available < t:2023-01-05 06:51:26,531 f:cluster_aws.py l:258 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create spot_low_price instance(s) < t:2023-01-05 06:51:32,489 f:cluster_aws.py l:278 c:sdcm.cluster p:ERROR > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Cannot create spot_low_price instance(s): Failed to get spot instances: capacity-not-available < t:2023-01-05 06:51:32,491 f:cluster_aws.py l:258 c:sdcm.cluster p:INFO > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Create on_demand instance(s) < t:2023-01-05 06:51:32,491 f:cluster_aws.py l:134 c:sdcm.cluster p:DEBUG > Cluster artifacts-ami-jenkins-db-cluster-c478a021 (AMI: ['ami-02643ff1876ab4973'] Type: i3.2xlarge): Creating 1 on-demand instances using AMI id 'ami-02643ff1876ab4973'... < t:2023-01-05 06:51:32,622 f:tester.py l:158 c:sdcm.tester p:ERROR > botocore.exceptions.ClientError: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-02643ff1876ab4973]' does not exist
~/Downloads/logs/sct-runner-c478a021$ grep "Failed creating a Scylla AWS cluster" sct.log ~/Downloads/logs/sct-runner-c478a021$
So if it is a matter of new sct code issue introduced, it should be around
fallback_provision_type
or similar.
you are correct, seen _create_auto_zone_scylla_aws_cluster
in the callstack, didn't had time to look into it further.
Anyhow still someone needs to investigate/fix it
my suggestion at this point is to add some prints, and once we reproduce the issue, we will have some more information... it looks like the fallback either removed the AMI id, or the region was changed while falling back...
i confirmed with @yaronkaikov that it is not possible that the AMI was not available, as the packer
requires the AMI to be available before moving one (and calling the test)
@yarongilor , please add some extra logs, if you still did not find the root cause of this issue
logs were added (#5653) but AFAICT the issue did not reproduce. moving this to on hold, until we find a way to reproduce it
Issue description
seems like the retry on different AZ that introduced in https://github.com/scylladb/scylla-cluster-tests/pull/5123 isn't working as expect, seems like it's missing the region somehow:
Installation details
Cluster size: 1 nodes (i3.2xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image:
ami-02643ff1876ab4973
(aws: us-east-1)Test:
artifacts-ami-test
Test id:c478a021-95e5-46ab-9646-641e1c831f47
Test name:scylla-master/artifacts/artifacts-ami-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor c478a021-95e5-46ab-9646-641e1c831f47` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=c478a021-95e5-46ab-9646-641e1c831f47) - Show all stored logs command: `$ hydra investigate show-logs c478a021-95e5-46ab-9646-641e1c831f47` ## Logs: - **sct-runner-c478a021.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c478a021-95e5-46ab-9646-641e1c831f47/20230105_065220/sct-runner-c478a021.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c478a021-95e5-46ab-9646-641e1c831f47/20230105_065220/sct-runner-c478a021.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/artifacts/job/artifacts-ami-test/993/)