openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.39k forks source link

OCPBUGS-44925: aws: add ec2:AllocateAddress perm requirement. #9234

Open r4f4 opened 3 days ago

r4f4 commented 3 days ago

It's needed by CAPA when Ipv4Pools are supplied.

openshift-ci-robot commented 3 days ago

@r4f4: This pull request references Jira Issue OCPBUGS-44925, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.18.0) matches configured target version for branch (4.18.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @gpei

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/installer/pull/9234): >It's needed by CAPA when Ipv4Pools are supplied. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Finstaller). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 3 days ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from r4f4. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[pkg/asset/installconfig/aws/OWNERS](https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
r4f4 commented 3 days ago

/label platform/aws

r4f4 commented 2 days ago

/hold

We are still missing another permission: ec2:AssociateAddress

time="2024-11-22T22:27:02Z" level=debug msg="E1122 22:27:02.785017     333 awsmachine_controller.go:543] \"Failed to reconcile BYO Public IPv4\" err=<"
time="2024-11-22T22:27:02Z" level=debug msg="\tfailed to reconcile Elastic IP: failed to associate Elastic IP \"eipalloc-0d9343e1bba507e66\" to instance \"i-0c664692a59d18dc5\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-ty3fb51s-e9af7-minimal-perm is not authorized to perform: ec2:AssociateAddress on resource: arn:aws:ec2:us-east-1:460538899914:elastic-ip/eipalloc-0d9343e1bba507e66 because no identity-based policy allows the ec2:AssociateAddress action. Encoded authorization failure message: jAhD8NZ_EO7bmq89o1YFJPdSyOJL0KSBMQh9r8DDvmX1GOASHHe1scQITr_dIA5P2rKWAPT-a54UTIind4Pqh4z4x-vXRLRk-k0Vq4u61G2CalS22C-Vw_oQhmiITgr9llWVtLP0SwsKYMT0uWxOlvlfqmwZ8BNw3bcgzP8W2N8wZnwB6pDW5BoPg7Zx-OgPd3rth36YPMawV8RW1B-LUY4aVsfWUmZfwfQXChsDesd39LClcPExlFh__cV8hwF4TYHJDruc6vqtwSdFhTyCq3ibWNAlutg-3ptOEM7zRx33USs4uTqLxdYLj4n-AaPdtj-ishlFEh0aZiyl6QmBvaecUTq4v2hUwyAssKdlwZIpjv7zoRYBw59qrBiksPkTQDOP-3cnLxIix6ZwX0nkDwCR3qG5ZwppzRAPpMYgOU03Uo9r3RMbB_pr9h0b6amdBBOilkYmnHIAk8_vWBvhBoBXblPc4LgbUv-ZB62g0oKM0GqwNJPp8JOaFMMSrL82cf2hxZ_a1Bv4sf2WwIoE7HY23Su7dE_KE8jwmhchRMPmb4nRVlyED-Vb39Tn14CZeWt4WFYZb2F6XBRXixuqCvcC-vxf2StrnUvlfczQA_bw1GqV8_0_6kvxAQvxOU7zCId4lQ3-cpCcfGh5Qeh3UwX5D1dDzeKCpqXbCnjT5mhn35Ani7CK7XpGTOWzK5VZu7unuau_n2L5292OQu2xbNPwgJTYpf_7nFwPRYVjE6RM_ZCU65TAJ_umlRpKbERYoahrBEpcJCVB2Z3WSzaaMfHvUYvOY8fKv6SglOCJjphoyLn70jkfZjLY5FuvxBwrlUfCqbPMFWn514b1ZY01o--5-v77NAQtIHmLaAvLv_pU2wKKa9g9qsjxnCPUMLIFxmgCUyKDT7nbBK9PUPipa4I8EEzWnw"
time="2024-11-22T22:27:02Z" level=debug msg="\t\tstatus code: 403, request id: b2c73727-f7fb-4ffd-b812-9484cae2ac11"
time="2024-11-22T22:27:02Z" level=debug msg=" >"
openshift-ci[bot] commented 2 days ago

@r4f4: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-shared-vpc-edge-zones 85617f66dae2cf18b5887de5eea42b0250386c5e link false /test e2e-aws-ovn-shared-vpc-edge-zones
ci/prow/e2e-aws-ovn-edge-zones 85617f66dae2cf18b5887de5eea42b0250386c5e link false /test e2e-aws-ovn-edge-zones
ci/prow/okd-scos-e2e-aws-ovn 85617f66dae2cf18b5887de5eea42b0250386c5e link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-external-aws-ccm 85617f66dae2cf18b5887de5eea42b0250386c5e link false /test e2e-external-aws-ccm

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
r4f4 commented 2 days ago

@mtulio any idea why this is happening?

mtulio commented 11 hours ago

@mtulio any idea why this is happening?

@r4f4 there is a problem in the machine manifest as the type added to the machineset manifest, m6i.xlarge, is not supported in the zone:

$ aws ec2 describe-instance-type-offerings --location-type availability-zone \
--filters Name=location,Values=us-west-2-wl1-sfo-wlz-1 \
 --region us-west-2 --query 'InstanceTypeOfferings[].InstanceType'
[
    "t3.xlarge",
    "g4dn.2xlarge",
    "t3.medium",
    "r5.2xlarge"
]

This is happening because is missing the permission ec2:DescribeInstanceTypeOfferings:

level=warning msg=unable to select instanceType on the zone[us-west-2-lax-1b] from the preferred \
list: [m6i.xlarge m5.xlarge r5.xlarge c5.2xlarge m5.2xlarge c5d.2xlarge r5.2xlarge]. \
You must update the MachineSet manifest: UnauthorizedOperation: You are not authorized to perform this operation. \
User: arn:aws:iam::460538899914:user/ci-op-nrkwfijt-e9af7-minimal-perm is not authorized to perform: \
ec2:DescribeInstanceTypeOfferings because no identity-based policy allows the \
ec2:DescribeInstanceTypeOfferings action
r4f4 commented 10 hours ago

@mtulio that should've been added by https://github.com/openshift/installer/pull/9114 edit: is that permission always needed when specifying edge machine pools? If so we should add it to the edge permission group in https://github.com/openshift/installer/pull/9230

mtulio commented 10 hours ago

@mtulio that should've been added by #9114 edit: is that permission always needed when specifying edge machine pools? If so we should add it to the edge permission group in #9230

@r4f4 ec2:DescribeInstanceTypeOfferings permissions is a default behavior when no instance is added to the (any) machine pool (CP, worker, or edge), it discovers what is the "best" supported instance to be used in the pool based in the target region (for general pools), and zone (for edge zones), using filters of that API. Not an edge-specific feature.

r4f4 commented 7 hours ago

@mtulio that should've been added by #9114 edit: is that permission always needed when specifying edge machine pools? If so we should add it to the edge permission group in #9230

@r4f4 ec2:DescribeInstanceTypeOfferings permissions is a default behavior when no instance is added to the (any) machine pool (CP, worker, or edge), it discovers what is the "best" supported instance to be used in the pool based in the target region (for general pools), and zone (for edge zones), using filters of that API. Not an edge-specific feature.

@mtulio that perm is not required in the non-edge case and we just display a warning that we could not find a preferred instance type. If the edge node cannot work with the default instance type, there should be a better default or further validation.

mtulio commented 5 hours ago

@mtulio that perm is not required in the non-edge case and we just display a warning that we could not find a preferred instance type.

@r4f4 I am interpreting this warning (which, imo, might be interpreted as failed in certain situations like CP or worker nodes' pool to prevent later failure) as required permission for control plane and worker pools. The installer will always call getInstanceTypeZoneInfo() when no instance type is set in the pool (master, worker), as this is the default path for IPI, right? Am I missing some bit? do we have an CI test with this scenario (default install, without setting custom instances)?

r4f4 commented 4 hours ago

@mtulio that perm is not required in the non-edge case and we just display a warning that we could not find a preferred instance type.

@r4f4 I am interpreting this warning (which, imo, might be interpreted as failed in certain situations like CP or worker nodes' pool to prevent later failure) as required permission for control plane and worker pools. The installer will always call getInstanceTypeZoneInfo() when no instance type is set in the pool (master, worker), as this is the default path for IPI, right? Am I missing some bit? do we have an CI test with this scenario (default install, without setting custom instances)?

It's not required, it's optional. If this call fails, we proceed with the hardcoded default instance types in the installer master, worker

r4f4 commented 4 hours ago

do we have an CI test with this scenario (default install, without setting custom instances)?

AFAIK we do not as the way in which the steps are written we always set an instance type in the install-config.yaml

mtulio commented 4 hours ago

@mtulio that perm is not required in the non-edge case and we just display a warning that we could not find a preferred instance type.

@r4f4 I am interpreting this warning (which, imo, might be interpreted as failed in certain situations like CP or worker nodes' pool to prevent later failure) as required permission for control plane and worker pools. The installer will always call getInstanceTypeZoneInfo() when no instance type is set in the pool (master, worker), as this is the default path for IPI, right? Am I missing some bit? do we have an CI test with this scenario (default install, without setting custom instances)?

It's not required, it's optional. If this call fails, we proceed with the hardcoded default instance types in the installer master, worker

my interpretation of this is required as, afaik, we don't expect the default path to fail :)

Furthermore, this function has been introduced long time ago, even before edge zones, to get the best instance in mostly regions, still covering regions that takes time to rolls up new gen of instances by AWS. For example, m6i.xlarge took some time to be available in eu-west-2 - where it supported only 5th Generation. Should the mostly users be penalty by getting more expensive, and slower instance types of mostly regions when some regions does not support it?

r4f4 commented 4 hours ago

@mtulio that perm is not required in the non-edge case and we just display a warning that we could not find a preferred instance type.

@r4f4 I am interpreting this warning (which, imo, might be interpreted as failed in certain situations like CP or worker nodes' pool to prevent later failure) as required permission for control plane and worker pools. The installer will always call getInstanceTypeZoneInfo() when no instance type is set in the pool (master, worker), as this is the default path for IPI, right? Am I missing some bit? do we have an CI test with this scenario (default install, without setting custom instances)?

It's not required, it's optional. If this call fails, we proceed with the hardcoded default instance types in the installer master, worker

my interpretation of this is required as, afaik, we don't expect the default path to fail :)

If we want it to be required, we have to remove the warning and actually fail the install. But that's not the case today and the warning was a design choice to make the permission optional.