Closed rsignell-usgs closed 1 year ago
@cisaacstern or @thodson-usgs did you get a chance to discuss this during the pangeo-forge meeting yesterday?
Is this still the recommended way to execute Beam pipelines on AWS?
No time to discuss it. @cisaccstern is occupied with their next release, so perhaps we wait a few more days in case @yuvipanda gets back.
Can you tell me what the other problems were, @rsignell-usgs?
I'm deleting and recreating this now to see what happens.
I believe the listener didn't fully start when we commented out some of the security groups in the terraform module.
The cluster was created, took 20 - 25 minutes or so to complete, but perhaps with an error message at the end, or we may have gotten an error when submitting a job, I don't recall. @rsignell-usgs would probably know better since it was deployed on his laptop, but he is at the ESIP meeting at the moment.
Thanks!
I can reproduce this now, even with the most up to date AWS terraform module. Looks to be this bug: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1409#issuecomment-1308120151. Another frustrating loss for autolock bots, as the bug clearly still exists.
Wow thanks for getting to the bottom of this so quickly, Yuvi!
I'm actually just rewriting the terraform to not depend on that module, hang on.
@cisaacstern @rsignell-usgs @mwengren so I've rewritten the terraform code to not use the third party EKS module it was using before, but instead just use the terraform AWS provider. This error is fixed now with https://github.com/yuvipanda/pangeo-forge-cloud-federation/pull/4 and the infrastructure itself is fully setup. I didn't try to run a recipe on it yet though, but it all applies cleanly.
Thanks for trying it out and reporting it.
@yuvipanda , a group of us here at SciPy 2023 (@jbusecke, @amsnyder, @thodson-usgs, @alaws-usgs, @kjdoore, @pnorton-usgs, @mwengren) sprinted on trying to get a pangeo-forge beam runner running on AWS.
We changed the bucket name and ran
terraform plan
, which produced no errors.We then ran
terraform apply
and it said that 42 resources would be deployed. When we confirmedterrraform apply
, it deployed 41 resources, but no cluster was created, and the process ended with these messages:We then tried commenting out two chunks of code in
cluster.tf
that specifyingress_flink_operator_webhook_tcp
for port 9443 and 8443.That allowed a cluster to be created, but then we ran into other problems, so we quit.
Any ideas?