pangeo-forge / pangeo-forge-cloud-federation

Infrastructure for running pangeo-forge across multiple bakeries
Apache License 2.0
3 stars 6 forks source link

duplicate Security Group issue #3

Closed rsignell-usgs closed 1 year ago

rsignell-usgs commented 1 year ago

@yuvipanda , a group of us here at SciPy 2023 (@jbusecke, @amsnyder, @thodson-usgs, @alaws-usgs, @kjdoore, @pnorton-usgs, @mwengren) sprinted on trying to get a pangeo-forge beam runner running on AWS.

We changed the bucket name and ran terraform plan, which produced no errors.

We then ran terraform apply and it said that 42 resources would be deployed. When we confirmed terrraform apply, it deployed 41 resources, but no cluster was created, and the process ended with these messages:

│ Error: [WARN] A duplicate Security Group rule was found on (sg-0be717e95562150a8). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: sg-0b26f58a9a66a2208, TCP, from port: 9443, to port: 9443, ALLOW" already exists
│       status code: 400, request id: 30732ad6-fcda-47e6-896a-05e0d2f79e22
│
│   with module.eks.aws_security_group_rule.node["ingress_flink_operator_webhook_tcp"],
│   on .terraform/modules/eks/node_groups.tf line 207, in resource "aws_security_group_rule" "node":
│  207: resource "aws_security_group_rule" "node" {
│
╵
╷
│ Error: [WARN] A duplicate Security Group rule was found on (sg-0be717e95562150a8). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: sg-0b26f58a9a66a2208, TCP, from port: 8443, to port: 8443, ALLOW" already exists
│       status code: 400, request id: ba62c5b0-d0ad-4b0c-88bd-aaee506e2ecd
│
│   with module.eks.aws_security_group_rule.node["ingress_nginx_ingress_webhook_tcp"],
│   on .terraform/modules/eks/node_groups.tf line 207, in resource "aws_security_group_rule" "node":
│  207: resource "aws_security_group_rule" "node" {

We then tried commenting out two chunks of code in cluster.tf that specify ingress_flink_operator_webhook_tcp for port 9443 and 8443.

That allowed a cluster to be created, but then we ran into other problems, so we quit.

Any ideas?

rsignell-usgs commented 1 year ago

@cisaacstern or @thodson-usgs did you get a chance to discuss this during the pangeo-forge meeting yesterday?

Is this still the recommended way to execute Beam pipelines on AWS?

thodson-usgs commented 1 year ago

No time to discuss it. @cisaccstern is occupied with their next release, so perhaps we wait a few more days in case @yuvipanda gets back.

yuvipanda commented 1 year ago

Can you tell me what the other problems were, @rsignell-usgs?

yuvipanda commented 1 year ago

I'm deleting and recreating this now to see what happens.

mwengren commented 1 year ago

I believe the listener didn't fully start when we commented out some of the security groups in the terraform module.

The cluster was created, took 20 - 25 minutes or so to complete, but perhaps with an error message at the end, or we may have gotten an error when submitting a job, I don't recall. @rsignell-usgs would probably know better since it was deployed on his laptop, but he is at the ESIP meeting at the moment.

Thanks!

yuvipanda commented 1 year ago

I can reproduce this now, even with the most up to date AWS terraform module. Looks to be this bug: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1409#issuecomment-1308120151. Another frustrating loss for autolock bots, as the bug clearly still exists.

cisaacstern commented 1 year ago

Wow thanks for getting to the bottom of this so quickly, Yuvi!

yuvipanda commented 1 year ago

I'm actually just rewriting the terraform to not depend on that module, hang on.

yuvipanda commented 1 year ago

@cisaacstern @rsignell-usgs @mwengren so I've rewritten the terraform code to not use the third party EKS module it was using before, but instead just use the terraform AWS provider. This error is fixed now with https://github.com/yuvipanda/pangeo-forge-cloud-federation/pull/4 and the infrastructure itself is fully setup. I didn't try to run a recipe on it yet though, but it all applies cleanly.

Thanks for trying it out and reporting it.