Open nelg opened 3 years ago
To get my one working, I ended up making the changes as per https://github.com/nelg/terraform-aws-nat-instance/commit/e4a0b3310d32d038a89d1c6a7d43498819a40b3e
Hi nelg
If I just wanna setup a Linux 2 NAT instance, don't wanna IaC to provision all other infra. which commands I should run to be able to have a Amazon Linux 2 NAT working?
Thanks in advance.
@nelg I am experiencing the same issue as you did and your fix seems to solve the problem. Could you provide @int128 with a PR that could be tested, merged and published to Terraform registry, so the whole module would be operational again?
@nelg I am experiencing the same issue as you did and your fix seems to solve the problem. Could you provide @int128 with a PR that could be tested, merged and published to Terraform registry, so the whole module would be operational again?
Sure, will do
Here is the PR https://github.com/int128/terraform-aws-nat-instance/pull/37
This issue and your fix solved 5+ hours of debugging work for me. Thank you and I hope it gets merged soon.
It seems NAT connection is lost after the NAT instance is rebooted.
ip route del default dev eth0
command is needed to change the default route to eth1 to fix the source IP, because an EIP of eth0 will be changed when the instance is recreated by Auto Scaling Group.
I noticed the route table is broken after reboot as follows:
## When an instance is created
ssm-user@ip-172-18-138-43 bin]$ ip ro
default via 172.18.128.1 dev eth1 metric 10001
169.254.169.254 dev eth0
172.18.128.0/20 dev eth0 proto kernel scope link src 172.18.138.43
172.18.128.0/20 dev eth1 proto kernel scope link src 172.18.132.145
ssm-user@ip-172-18-138-43 bin]$ sudo reboot
## After reboot
ssm-user@ip-172-18-138-43 bin]$ ip ro
default via 172.18.128.1 dev eth0
default via 172.18.128.1 dev eth1 metric 10001
169.254.169.254 dev eth0
172.18.128.0/20 dev eth0 proto kernel scope link src 172.18.138.43
172.18.128.0/20 dev eth1 proto kernel scope link src 172.18.132.145
Finally I could fixed this problem by removing the config of eth0:
sudo rm /etc/sysconfig/network-scripts/ifcfg-eth0
I will add it to the script.
I think #42 resolved the issue. Please let me know if the issue still occurs.
I have tested version 2.0.1 release on terraform registry, and it doesn't work.. still have eth0 as the default route, so the instance can't send traffic to the internet.
which version should I test?
I'm quite keen to get a version of this published on the registry that works. Rather than me publishing a copy of your one, can we work together to get it working, if you have time sometime in the next couple of weeks.
My solution is working for us, but it's not perfect and ends up with 2 default routes, and two interfaces in the same subnet. The two ENI's attached, 1 has a public IP and a private IP, the other just has a private IP. We have to route out the one that has a public IP to get to the internet.
Yeah, this latest fix is bogus.
I built this module from the example in README.md and this is my NAT instance's networking details after a reboot:
sh-4.2$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 06:e8:a4:c9:de:f6 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
link/ether 06:33:c7:03:41:92 brd ff:ff:ff:ff:ff:ff
inet 10.0.128.88/24 brd 10.0.128.255 scope global dynamic eth1
valid_lft 3401sec preferred_lft 3401sec
inet6 fe80::433:c7ff:fe03:4192/64 scope link
valid_lft forever preferred_lft forever
sh-4.2$ ip route
default via 10.0.128.1 dev eth1 metric 10001
10.0.128.0/24 dev eth1 proto kernel scope link src 10.0.128.88
sh-4.2$ sudo iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE all -- anywhere anywhere
Long story short, it appears this module is broken, I tried downgrading to 2.0.0, but after that I couldn't even connect to the EC2 instance via SSM to debug this.
When you tried using it, did you have an eip assigned to the nat instance? It has to be created externally to the module, then passed in.
I had problems when I didn't assign an EIP.
On Sat, 23 Jul 2022, 2:53 AM Julian Calaby, @.***> wrote:
Yeah, this latest fix is bogus.
I built this module from the example in README.md and this is my NAT instance's networking details after a reboot:
sh-4.2$ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 06:e8:a4:c9:de:f6 brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000 link/ether 06:33:c7:03:41:92 brd ff:ff:ff:ff:ff:ff inet 10.0.128.88/24 brd 10.0.128.255 scope global dynamic eth1 valid_lft 3401sec preferred_lft 3401sec inet6 fe80::433:c7ff:fe03:4192/64 scope link valid_lft forever preferred_lft forever sh-4.2$ ip route default via 10.0.128.1 dev eth1 metric 1000110.0.128.0/24 dev eth1 proto kernel scope link src 10.0.128.88 sh-4.2$ sudo iptables -t nat -L Chain PREROUTING (policy ACCEPT) target prot opt source destination
Chain INPUT (policy ACCEPT) target prot opt source destination
Chain OUTPUT (policy ACCEPT) target prot opt source destination
Chain POSTROUTING (policy ACCEPT) target prot opt source destination MASQUERADE all -- anywhere anywhere
Long story short, it appears this module is broken, I tried downgrading to 2.0.0, but after that I couldn't even connect to the EC2 instance via SSM to debug this.
— Reply to this email directly, view it on GitHub https://github.com/int128/terraform-aws-nat-instance/issues/35#issuecomment-1192656028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAY4PN3FAU7YM5G4MY4ERLVVKYXHANCNFSM4XXI7VQA . You are receiving this because you were mentioned.Message ID: @.***>
Yep, had to reorganise stuff so I could, but I did have an EIP on the NAT instance when I did my first round of testing with version 2.0.1.
My initial testing of this module failed to produce a working internet connection on the NAT instance or an instance on a private subnet, so it looks like something's misconfigured or missing. For the record, it's possible that the "something missing" is entirely my fault.
My understanding is that NAT gateways work like this: private host -> network interface -> NAT -> route table -> internet. So therefore how we get to the internet shouldn't matter, which makes the act of deleting the eth0 configuration script and therefore leaving that interface unconfigured after a reboot seem bogus as it shouldn't matter. That said, all my previous hacking has used separate interfaces for the input and output sides of the NAT gateway, so it's quite possible it'll all work on one interface and leaving eth0 unconfigured is correct.
I suspect that I've made a mistake somewhere here, but I also know that the NAT gateway should have had internet access in my testing, and the fact that it doesn't is concerning. I'm going to try a couple of other options then maybe return to this depending on the outcome. fck-nat seems promising if I can figure out a simple way to Terraformise it's setup.
(Another thing that stood out is that the ENI handling needs to be smarter: we should be able to detect whether it's already connected or somehow still in-use (e.g. after an instance is terminated) and respond appropriately.)
I've been thinking about this over the past couple of days and worked out why deconfiguring eth0
and requiring an EIP felt so wrong to me, and what I did wrong to break my instance of this module.
Essentially the bit I was missing here is that we need to have a public IP address so we can send stuff through an internet gateway and that the floating ENI (eth1
) doesn't get one by default, so we need to assign an EIP to it so it has a public IP and can therefore connect out, otherwise we kill our internet connection when we deconfigure eth0
.
This makes sense with the current use cases:
eth1
has a public IP, all our connectivity can all be done on eth1
so therefore we don't need eth0
.The reason why it wasn't working for me initially is because if the EIP isn't available before the EC2 instance starts, it doesn't get the routes it needs and is therefore cut off from the internet.
I'd really like this module to work without an EIP, so I'm going to hack together a patch to always use eth0
for output which should make this more reliable and drop the EIP requirement unless people are doing DNAT. (DNAT should still work even if we're using eth0
for our default route.)
Ok:
eth0
for the upstream connection: #52 (Note that this is on top of the previous change)@int128 these changes are probably overkill and I haven't tested DNAT, but they Work For Me so they should be mergeable.
This module uses eth1 with the EIP to pin the source IP address. If eth0 is used, the source IP address may fluctuate.
I think your change breaks the fixed IP feature. How do you think?
This is what I have been using, which seems to be ok, at least not enough I've had problems.
module "nat" {
source = "github.com/int128/terraform-aws-nat-instance?ref=5a3d3f41568d8af145e291067f1e6e9d71fb36fd"
enabled = var.nat_gw ? false : true
name = "natgw"
vpc_id = module.vpc.vpc_id
public_subnet = module.vpc.public_subnets[0]
private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks
private_route_table_ids = var.nat_gw ? [] : module.vpc.private_route_table_ids
}
resource "aws_eip" "nat" {
network_interface = module.nat.eni_id
tags = {
"Name" = "nat-instance-main"
}
}
This module uses eth1 with the EIP to pin the source IP address. If eth0 is used, the source IP address may fluctuate.
I think your change breaks the fixed IP feature. How do you think?
I guess it depends on your use case.
If you need all your NATed traffic to come from a constant IP, then yeah, this breaks that, but this should be a pretty niche use-case and NAT instances should be pretty long-running and therefore have a relatively constant IP address, just not one known in advance.
If DNAT port forwarding is enabled, it should still work as long as the services inside the private subnet aren't expecting to be able to tell remote services something like "hey, connect to whatever my IP is, but on port 1234", where port 1234 has previously been opened using DNAT. Again, this should be a pretty niche use-case and I think that most common services that do this, e.g. active FTP, already have special case handling in Linux.
I guess that in my opinion, a constant source IP address isn't required for well over 90% of use cases, so this will be fine and removes the need for an EIP, reducing costs and resource usage.
But yeah, we can't ignore those niche cases, so maybe this should be switchable then? No EIP required for the common use cases, and tell the module it'll have an EIP if you absolutely need certainty about the source IP.
If you need all your NATed traffic to come from a constant IP, then yeah, this breaks that, but this should be a pretty niche use-case and NAT instances should be pretty long-running and therefore have a relatively constant IP address, just not one known in advance.
I think this case needs to be supported, it's not that uncommon to have a white listed external IP.
I guess that in my opinion, a constant source IP address isn't required for well over 90% of use cases, so this will be fine and removes the need for an EIP, reducing costs and resource usage.
As per the AWS docs: An Elastic IP address doesn’t incur charges as long as all the following conditions are true:
So, having the Elastic IP I don't think is adding any costs, because the NAT instance exists all of the time.
I think this case needs to be supported, it's not that uncommon to have a white listed external IP.
I agree that there are situations where it's needed, so I'll make it configurable.
As per the AWS docs: An Elastic IP address doesn’t incur charges as long as all the following conditions are true:
- The Elastic IP address is associated with an EC2 instance.
- The instance associated with the Elastic IP address is running.
- The instance has only one Elastic IP address attached to it.
- The Elastic IP address is associated with an attached network interface. For more information, see Network interface basics.
So, having the Elastic IP I don't think is adding any costs, because the NAT instance exists all of the time.
True, but you're limited to 5 of them without jumping through hoops with AWS support - I had to change how I was doing stuff in my VPC because I was using all 5 before I deployed this, so for people in situations where stuff you can't use is using most of your allocation or you want more than 5 VPCs with NAT instances, it'd be nice to not require one.
I am using this module then, EIP is not attching to nat instance and the snat service is failing
When i analyzed the repo i find out that
This NAT module has a runonce.sh script and a snat.sh Now when the Launch template is created, the user data section has a execution command to exec this runonce.sh script
Now this runonce.sh script is responsible to attach the ENI to the same nat instance and then start the snat service which in turn calls the /opt/nat/snat.sh script that configures NAT configuration.
But this is not working as per expected runonce.sh is not getting executed.
Hi,
I've had issues with this not working, although it used to work.
It seems that when it deletes the default route:
The nat instance then looses all internet connectivity.
Does this still work for you?