terraform-aws-modules / terraform-aws-eks

Terraform module to create Amazon Elastic Kubernetes (EKS) resources πŸ‡ΊπŸ‡¦
https://registry.terraform.io/modules/terraform-aws-modules/eks/aws
Apache License 2.0
4.39k stars 4.04k forks source link

Pods suddenly won't deploy randomly on certain nodes; failing to pull pause sandbox image from ECR #2984

Closed ForbiddenEra closed 5 months ago

ForbiddenEra commented 5 months ago

Description

I have no idea if this is related but I've come up empty in all my searching and I can't think of anything else, but it's quite possibly not related.

Since updating to v20 and switching to Access Entries or whatever, I've had an issue where occasionally nodes cannot start new pods and the error I get on the pods is:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ca-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

This has happened on self-managed node groups (running ubuntu-jammy AMI, probably serial 20240229 though am updating right now) and I'm pretty sure also with EKS-managed node groups using the AL2 AMI and eventually the AL2023 AMI (which is one of the main reasons I pushed through and updated to v20)

It's an inconsistent issue; the last time it manifested, it only affected about 50-60% of my nodes, the rest were still working fine - nodes that were in the same node group as others that had no issue, all running the same AMI and same instance type, nor was AZ a correlated factor.

I was unable to determine a root cause after much searching and digging, but rotating (terminating/re-creating) the all my instances seemed to 'solve' the issue as after the new instances were up and joined there was no longer any error creating pods. Everything had been working perfectly since I last cycled the nodes, then just suddenly resurfaced today.

I'd hoped that the issue wouldn't resurface and/or that recent AMI/EKS updates might've solved the issue, but I got a ping tonight from a dev who was unable to spawn pods and upon checking I had the same error again.

I am 99% sure I never saw this before updating to v20 and switching to Access Entry. I followed the directions where I used the terraform-aws-eks-v20-migrate version of the module to perform the initial v20 migration from the last v19 version, then after migration switched to the first v20 version then updated to 20.8.3 where it currently sits.

I don't recall any major issues during the update besides being a bit bewildered trying to wrap my head around this new system suddenly without enough time to really wrap my head around it. The only issue I can think of is that I did try to update while retaining the use of the config map with API_AND_CONFIG_MAP but as soon as I switched from the migration module version, the configmap was removed (as it was originally created/managed by the module anyway) so I figured I'd just suck it up and commit to access entry since it's the way forward anyway; it's still set to API_AND_CONFIG_MAP in my terraform config as like I said I haven't had enough time to fully dig into the changes and everything and reading that I wouldn't be able to revert to using the configmap if I changed it to just API made me keep it set that way at least until I knew everything was functional, I was a bit more read about it and that I wouldn't need to revert. Admittedly, I figure I'm probably fine to switch it to API now that I've updated and no longer even have the configmap but I sort of just left the config alone after everything was seemingly working that night as it was late and hadn't touched it since, though I can't think of any negative side effects from it being set that way from what I've read; feel free to correct me if I'm wrong on any of this, again I haven't had the chance to fully read up on the access entry stuff, just enough to be barely comfortable enough with updating, heh.

I do have arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly under iam_role_additional_policies under both eks_managed_node_group_defaults and self_managed_node_group_defaults along with some other roles, I haven't read anything suggesting that these wouldn't still be required and, for this issue, all deployed node groups should have that policy attached and thus shouldn't have any issues with reading from the container registry..?

As I said, this might be un-related to the TF module and might be an AWS thing but I haven't been able to find much information and the only correlation I can think of is the recent update to access entry from updating to v20 of the module. I also hadn't run any terraform operations with the module recently that could be related, further solidifying that it's potentially unrelated but the only correlation I have is switching to access entry, If anyone has any insight, it'd be much appreciated.

Versions

Reproduction Code [Required]

Not immediately reproducible; was at least a week if not two of smooth operation after last node recycle to issue resurfacing.

bryantbiggs commented 5 months ago

please open an AWS support case for this - it is not related to the module, but it looks like its something service related

ForbiddenEra commented 5 months ago

please open an AWS support case for this - it is not related to the module, but it looks like its something service related

Figured that was the case; just wanted to check in here as well in case on the off chance it was (and admittedly as I originally stated, quite unlikely) related in any way or if at least perhaps others had seen/run into the issue otherwise; if there was anyone (of people I'm aware of which isn't many but it's clear you've put a ton of work into these modules and thus from that alone are quite familiar with AWS/EKS in general) I figured would be familiar/in deep enough to have come across and if it's not something you've seen or are familiar with then not only does that essentially eliminate the possibility of modules being remotely culpable, it further reinforces that it's a weird AWS issue and probably not overly common.

In my searching, I saw a few threads over the years where people saw similar/same error but I'm pretty sure that each was for unrelated reasons. I've had a bunch of different random issues since going to 1.29 on EKS anyway, sure hope that the most recent update solves this one as well, seems like there's been more updates in the first bit of 1.29 than I remember there was for 1.28, 1.27, 1.26, so at least they seem aware and are trying to fix things up, makes me want to consider waiting a bit when 1.30 comes out, though I had zero issues with the last few updates, so Iunno.. πŸ€·β€β™‚οΈ

I appreciate your re-assurance that it shouldn't be at all related to any module configuration and that it's seemingly not something you've seen. I was pretty convinced it was unrelated but the only thing(s) I really changed was updating the module, the required stuff for update like Access Entry and switching my managed nodes to AL2023, though it affected both eks-managed AL nodes and self-managed Ubuntu nodes.

I'll definitely open a ticket if it pops up again, hopefully it doesn't.

TL/DR; Regardless, thanks for taking a sec to chime in and for your contributions and shared hardwork otherwise (and other contributors too, of course!)

github-actions[bot] commented 4 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.