int128 / terraform-aws-nat-instance

Terraform module to provision a NAT Instance using an Auto Scaling Group and Spot Instance from $1/month
https://registry.terraform.io/modules/int128/nat-instance/aws/
Apache License 2.0
175 stars 90 forks source link

Instance stuck if ENI wasn't attached properly #57

Open LiranV opened 1 year ago

LiranV commented 1 year ago

Hello, I've encountered the following issue:

  1. The NAT ec2 instance needs to be replaced due to failure or spot termination.
  2. The original instance is removed and the ASG is spawning a new one.
  3. In the meantime the ENI that was used by the instance is still not available for reattachment.
  4. The new instance starts but fails to attach the ENI and gets stuck in a loop while not forwarding traffic.

This happens because the aws ec2 attach-network-interface command in the runonce.sh script to fails, but it still moves on to starting the snat service.

In the snat.sh script (ran by the snat.service) we have the following loop:

while ! ip link show dev eth1; do
  sleep 1
done

Which will run forever as the eth1 interface will never be available.

Possible solutions:

  1. Add a check after aws ec2 attach-network-interface to see that the interface was actually attached (or check return code), if not, fail somehow.
  2. Make it so the loop won't run forever so an additional script can be added by the users of the module to detect this and handle this however they see fit.
hnryjms commented 8 months ago

Can we just terminate the instance if the aws ec2 attach-network-interface command fails? Presumably the ENI will free up after a minute or two, and the second or third EC2 box launched by the Auto-Scaling Group would succeed in attaching the ENI.

Edit: PR #72 seems pretty good also .. how come it's not merged?