redhat-developer / mapt

Multi Architecture Provisioning Tool
Apache License 2.0
9 stars 7 forks source link

[BUG] Spot instance type recommendation not available on recommended region #313

Open adrianriobo opened 1 month ago

adrianriobo commented 1 month ago

There is some checks missing when looking for best spot price / type machine:

We can see:

DEBU Based on avg prices for instance types [m4.large m5.large m5a.large m5ad.large m5d.large m5dn.large m5n.large m5zn.large m6a.large m6i.large m6id.large m6idn.large m6in.large m7a.large m7i-flex.large] is az eu-west-2b, current avg price is 0.05 and max price is 0.05 with a score of 9 
INFO @ updating.............                      
INFO  +  rh:qe:aws:bso main-bso-bso creating (0s)  
INFO  +  rh:qe:aws:bso main-bso-bso created       
INFO  +  pulumi:pulumi:Stack debug-fedora-spotOption-debug-fedora created (10s)  
INFO Outputs:                                     
INFO     avg   : 0.0452                           
INFO     az    : "eu-west-2b"                     
INFO     max   : 0.0452                           
INFO     region: "eu-west-2"                      
INFO     score : 9                                

But when we try to use the recommended machine type we got:

Diagnostics:
  aws:ec2:Eip (eip-publicmain-afd-net):
    warning: urn:pulumi:stackFedoraBaremetal-debug-fedora::debug-fedora::aws:ec2/eip:Eip::eip-publicmain-afd-net verification warning: use domain attribute instead

  pulumi:pulumi:Stack (debug-fedora-stackFedoraBaremetal-debug-fedora):
    error: update failed

  aws:autoscaling:Group (main-afd-asg):
    error: 1 error occurred:
        * creating Auto Scaling Group (main-afd-asg-89ce230): operation error Auto Scaling: CreateAutoScalingGroup, https response error StatusCode: 400, RequestID: 0175d099-0523-431f-9dee-a39975af885d, api error ValidationError: The specified instance type m5zn.large is not valid

Resources:
    + 20 created

Duration: 2m45s
adrianriobo commented 1 month ago

This is partially fixed with https://github.com/redhat-developer/mapt/pull/314/commits/77281e5c39060077c64f53ca8bf3914bccf09c6b but still there are inconsistencies for all the types of machines and regions (i.e. windows on aws does not using it at all).

There is an option for using metadata (specs for machines instead of actual types) for spot price searches and for autoscaling groups. We may need consider if we can make us of them maybe on one side or even on both.

adrianriobo commented 3 weeks ago

Running some pipelines tryting to provision Fedora with arm64 on Azure I got a similar issue:

DEBU Best spot price option found: &{standard_d16ps_v5 westus 0.095433} 
INFO @ updating....                               
INFO  +  azure-native:resources:ResourceGroup fedora-als-rg creating (0s)  
INFO  +  tls:index:PrivateKey fedora-als-privatekey-user creating (0s)  
INFO @ updating.....                              
INFO  +  azure-native:resources:ResourceGroup fedora-als-rg created (1s)  
INFO  +  azure-native:network:PublicIPAddress fedora-als-pip creating (0s)  
INFO  +  azure-native:network:VirtualNetwork fedora-als-vn creating (0s)  
INFO @ updating....                               
INFO  +  tls:index:PrivateKey fedora-als-privatekey-user created (2s)  
INFO @ updating.....                              
INFO  +  azure-native:network:PublicIPAddress fedora-als-pip created (3s)  
INFO @ updating......                             
INFO  +  azure-native:network:VirtualNetwork fedora-als-vn created (5s)  
INFO  +  azure-native:network:Subnet fedora-als-sn creating (0s)  
INFO @ updating.......                            
INFO  +  azure-native:network:Subnet fedora-als-sn created (4s)  
INFO  +  azure-native:network:NetworkInterface fedora-als-ni creating (0s)  
INFO @ updating.....                              
INFO  +  azure-native:network:NetworkInterface fedora-als-ni created (2s)  
INFO  +  azure-native:compute:VirtualMachine fedora-als-vm creating (0s)  
INFO @ updating.................................... 
INFO  +  azure-native:compute:VirtualMachine fedora-als-vm creating (32s) error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in westus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference" 
INFO  +  azure-native:compute:VirtualMachine fedora-als-vm **creating failed** error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in westus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference" 
INFO  +  pulumi:pulumi:Stack fedora-stackAzureLinux-fedora creating (48s) error: update failed 
INFO  +  pulumi:pulumi:Stack fedora-stackAzureLinux-fedora **creating failed (48s)** 1 error 
INFO Diagnostics:                                 
INFO   azure-native:compute:VirtualMachine (fedora-als-vm): 
INFO     error: Code="GalleryImageNotFound" Message="\"The gallery image /CommunityGalleries/Fedora-5e266ba4-2250-406d-adad-5d73860d958f/Images/Fedora-Cloud-40-Arm64/Versions/latest is not available in westus region. Please contact image owner to replicate to this region, or change your requested region.\"" Target="imageReference" 
INFO                                              
INFO   pulumi:pulumi:Stack (fedora-stackAzureLinux-fedora): 
anjannath commented 1 week ago

So for aws we need to check the spot prices based on the VM specs instead of directly using EC2 specific vm type names, for Azure i think we'll have to have a another filter, first we find out the VM candidates then filter it again based on the requested OS if that image is available in the selected location and finally return the result

for azure i also noticed that some regions don't have support for Resource Groups, which is a hard requirement for mapt as of now, so we can limit the spot search by default to the regions that does support Resource Groups

Diagnostics:
  azure-native:resources:ResourceGroup (az-ghrunner-awd-rg):
    error: autorest/azure: Service returned an error. Status=400 Code="LocationNotAvailableForResourceGroup" Message="The provided location 'southafricawest' is not available for resource group. List of available regions is 'eastasia,southeastasia,australiaeast,australiasoutheast,brazilsouth,canadacentral,canadaeast,switzerlandnorth,germanywestcentral,eastus2,eastus,centralus,northcentralus,francecentral,uksouth,ukwest,centralindia,southindia,jioindiawest,italynorth,japaneast,japanwest,koreacentral,koreasouth,mexicocentral,northeurope,norwayeast,polandcentral,qatarcentral,spaincentral,swedencentral,uaenorth,westcentralus,westeurope,westus2,westus,southcentralus,westus3,southafricanorth,australiacentral,australiacentral2,israelcentral,westindia,newzealandnorth'."
adrianriobo commented 1 week ago

Yeah seems reasonable. On my partial fix I applied the suggested fix you said for azure on AWS, so if we want to change everything there (meaning instead of check spot by type machine use directly the specs) we can have it a separate issue (enhancement).

On the azure side definitely apply the second filter and remove regions not supporting Resource Groups.

Just as a side note on this last thing, do not want to complicate things but if I am not wrong nothing is preventing to have the resource group on a different region than the actual resources it groups.