opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.34k stars 749 forks source link

Wrong outgoing source IP (0.0.0.0) #6036

Closed Monke202 closed 1 year ago

Monke202 commented 2 years ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

Outgoing packets, originating from the firewall itself, on the WAN interface have the ip source address 0.0.0.0 instead of the configured ip address from the corresponding interface.

Last known working version: 22.7.3_2

To Reproduce

Steps to reproduce the behavior:

  1. Upgrade version 22.7 -> 22.7.3
  2. Upgrade version 22.7_4 -> 22.7.3_2
  3. Update to version 22.7.4 fails because the network is not functional anymore

Expected behavior

The firewall can reach the external network.

Describe alternatives you considered

A source NAT rule for 0.0.0.0/32 on the WAN interface solves the problem.

Screenshots

nat_rule

Relevant log files

Error configd.py Timeout (120) executing : firmware remote

Additional context

Environment

Multi-WAN Setup

Firewall Server OPNsense 22.7.4-amd64 FreeBSD 13.1-RELEASE-p2 OpenSSL 1.1.1q 5 Jul 2022 Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz (4 cores, 8 threads) Network Intel® I210 Network Intel® I350

AdSchellevis commented 2 years ago

best check the local routing table first (netstat -nr)

Monke202 commented 2 years ago

Thanks for the reply. We looked into the routing table and couldn't find any suspicious entries. We also compared them with a working opnsense setup on another location (single wan setup).

tcpdump on the wan interface shows that the outgoing packages have the 0.0.0.0 as their source address. When we configure outgoing nat rules for source 0.0.0.0 , tcpdump shows the correct source ip for the wan gateway.

putt1ck commented 1 year ago

We have the same issue on a firewall during upgrade to 22.7.4. Noting that other firewalls with a near identical config in the same organisation completed successfully; the primary difference between the one that has the issue and the others is it has an SSN interface, and as the primary site a more complex set of rules on NAT and on the firewall generally.

Doing tests at firewall CLI with the host command and a specified external nameserver (because the initial symptom was noted as being DNS resolution fails) and watching the firewall logs, you can see all traffic originating from the firewall gets 0.0.0.0 as source IP. Testing with ping to an IP similarly shows the source as 0.0.0.0.

putt1ck commented 1 year ago

NB suggested workaround of adding NAT rule for 0.0.0.0/32 worked, thanks @Monke202

fichtner commented 1 year ago

@putt1ck still unsure where this traffic originates from and why it gets a 0.0.0.0 source address. Do you still have a setup to reproduce? Is "0.0.0.0" found in ifconfig output or in the file /tmp/rules.debug?

Cheers, Franco

putt1ck commented 1 year ago

So logged into the firewall over SSH, and running the host command # host -t A google.de 9.9.9.9 ;; connection timed out; no servers could be reached

Viewing the logs via the web UI and you see

  | 0.0.0.0:34147 | 9.9.9.9:53 | udp | let out anything from firewall host itself -- | -- | -- | -- | --

where on an install without the issue the 0.0.0.0 shows as the firewall IP. The workaround is in place on that firewall so assume removing the rule will allow further tests (but would need to wait until out of hours).

ifconfig doesn't show an interface with 0.0.0.0

/tmp/rules.debug has only the rule added as a workaround i.e. nat on em1 inet from 0.0.0.0/32 to any -> (em1:0) port 1024:65535 # Workaround for internal NAT issue

fichtner commented 1 year ago

@putt1ck just to be on the safe side can you disable "Use shared forwarding between packet filter, traffic shaper and captive portal" under Firewall: Settings: Advanced and see if the issue persists? If yes it's a routing table issue in FreeBSD 13... netstat -nr4 might reveal something in that case.

fichtner commented 1 year ago

Very old bug report, not sure if still applies (and not resolved) https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=159103

putt1ck commented 1 year ago

@fichtner Shared Forwarding was not enabled. I tried enabling it just for fun but either way (without the manual rule workaround) it doesn't resolve the issue. Ref that bug, I can't see a "network_interfaces" conf line that looks like the one described in the workaround.

fichtner commented 1 year ago

@putt1ck the bug was for FreeBSD, which uses rc.conf syntax to init devices so that part doesn't apply for us. It suggests a problem with loopback devices. Do you have additional loopback devices configured?

Cheers, Franco

putt1ck commented 1 year ago

I completed the upgrade and the issue still exists if you want to try more fixes.

Running upgrade from console reports probably unrelated issue

py37-markupsafe has a missing dependency: python37
py37-markupsafe has a missing dependency: py37-setuptools
py37-markupsafe is missing a required shared library: libpython3.7m.so.1.0

>>> Missing package dependencies were detected.
>>> Found 2 issue(s) in the package database.

pkg-static: No packages available to install matching 'python37' have been found in the repositories
pkg-static: No packages available to install matching 'py37-setuptools' have been found in the repositories
>>> Summary of actions performed:

python37 dependency failed to be fixed
py37-setuptools dependency failed to be fixed
fichtner commented 1 year ago
# pkg remove py37-markupsafe

Long unused... introduced by a bug in package manager while renaming the package from mixed case to lower case letters.

putt1ck commented 1 year ago

Only lo0 listed; and the same set of interfaces listed on a branch site firewall with identical hardware which updated without this issue arising. The only obvious difference for network interfaces is the SSN interface which does not exist on the branch one.

Looking at NAT configs, NPTv is the same (unused; Outbound is basically the same except the main office (the one with the issue) has more manual rules (larger office, more internal subnets) and some of those rules have specified external addresses (connection has a /29) where the branch only uses "interface address"; One-to-one has an entry in main but none at branch; Port forward at branch has only a few entries (2+anti-lockout) while main has many ~25, including one "loopback" for capturing NTP queries (Android, why does it ignore DHCP conf?): ! LAN address 123 (NTP) firewall internal interface address 123 (NTP)
encbladexp commented 1 year ago

I had a similar issue today, the only this that fixed it: Reboot of the OPNSense appliance.

Before this happened, I noted strange DNS Issues and started to debug unbound DNS as well as IPv6, which didn't solve my issues. Systems behind the appliance worked well (except DNS, due to side effects of this bug), but everything on the firewall itself used 0.0.0.0 as source address.

putt1ck commented 1 year ago

What version was the affected firewall running? Did it start on an upgrade or maybe some other recent change?

encbladexp commented 1 year ago

Currently it is OPNsense 22.7.9_3-amd64, I am unsure if it started on the upgrade, but what I did shortly before:

I am still searching for a correlation. I noted that bug only because unbound DNS didn't work anymore, which has some impact on my network for sure. A quick check on a SSH session showed that all direct packets outgoing from the Appliance itself are using 0.0.0.0 as source IP, all forwarded packets work as expected.

AdSchellevis commented 1 year ago

Just had a similar issue, for some reason my routing table changed from :

  Internet:
  Destination        Gateway            Flags     Netif Expire
  default            xxx.xxx.xxx.1        UGS        igc1
  xxx.xxx.xxx.0/24   link#2               U          igc1
  xxx.xxx.xxx.121    link#2               UHS         lo0

to

  Internet:
  Destination        Gateway            Flags     Netif Expire
  default            92.108.79.1        UGS        igc1
  xxx.xxx.xxx.0/24     link#2             U          igc1
  xxx.xxx.xxx.1        link#2             UHS        igc1      <<<---- ??
  xxx.xxx.xxx.121      link#2             UHS         lo0

If anyone has the same 0.0.0.0 outbound issue, it might be worth checking if the gateway isn't configured on a link (which would explain the behaviour, although I don't know where it came from).

AdSchellevis commented 1 year ago

might be https://github.com/opnsense/core/commit/a230326d7fe165e597cd2d5a30b064e0b3a1c58c as well

fichtner commented 1 year ago

Not really, this has been happening since 22.1 (FreeBSD 13).

mimugmail commented 1 year ago

I'm just in a teams session with a customer and the same phenomenon, at our site they disabled NAT, we set it to manual without adding any rules and then it replaced 0.0.0.0 to its original WAN address.

Maybe it helps :)

AdSchellevis commented 1 year ago

@mimugmail what does the routing table look like? mine missed a link, which is why it looked similar to https://github.com/opnsense/core/commit/a230326d7fe165e597cd2d5a30b064e0b3a1c58c (but maybe something completely different)

mimugmail commented 1 year ago

Routing table is ok .. we were able to login via WAN with SSH and UI, only local generated packets (DNS queries) were using 0.0.0.0 as the source. Sadly I already hopped off the session

Pinoir commented 1 year ago

I have what I think is the same issue. I've been chasing this for a long time, but I'm not a network specialist so was assuming it was just my lack of expertise.

In case it's useful, I can ping using a specified interface (-S) from OPNSense, but a regular ping times out. LAN traffic is routed correctly. It's just local system traffic that doesn't go anywhere.

Something else I observed when the issue started is when a VPN tunnel was up, it would route traffic from clients over the VPN, but "normal" traffic stopped working. Taking the tunnel down again and traffic started flowing again, in case that's related.

I'm running 22.7.4 on ESXi. Happy to do troubleshooting!

AdSchellevis commented 1 year ago

The cases I have seen relate to missing link addresses in the routing table, netstat -nr -4 would easily tell you if that's the case. If for some reason the address is removed but not added again, it would explain the behavior (https://github.com/opnsense/core/commit/a230326d7fe165e597cd2d5a30b064e0b3a1c58c caused that, but there might be other reasons like a dhcp client not playing nicely).

Pinoir commented 1 year ago

This is what I have. vmx0 is wan, vmvx1 is lan.

Routing tables

Internet: Destination Gateway Flags Netif Expire default 192.168.0.1 UGS vmx0 8.8.4.4 192.168.0.1 UGHS vmx0 8.8.8.8 192.168.0.1 UGHS vmx0 127.0.0.1 link#4 UH lo0 192.168.0.0/24 link#1 U vmx0 192.168.0.1 link#1 UHS vmx0 192.168.0.250 link#1 UHS lo0 192.168.10.0/23 link#2 U vmx1 192.168.11.250 link#2 UHS lo0

AdSchellevis commented 1 year ago

what does route show 8.8.8.8 result into?

Pinoir commented 1 year ago

route to: dns.google destination: dns.google gateway: 192.168.0.1 fib: 0 interface: vmx0 flags: <UP,GATEWAY,HOST,DONE,STATIC> recvpipe sendpipe ssthresh rtt,msec mtu weight expire 0 0 0 0 1500 1 0

AdSchellevis commented 1 year ago

ok, that's good, doesn't look like a routing issue then. Next question is about the nat rules, what do they look like:

grep '^nat on' /tmp/rules.debug
Pinoir commented 1 year ago

nat on vmx0 inet from (vmx1:network) to any port 500 -> (vmx0:0) static-port # Automatic outbound rule nat on vmx0 inet from (lo0:network) to any port 500 -> (vmx0:0) static-port # Automatic outbound rule nat on vmx0 inet from 127.0.0.0/8 to any port 500 -> (vmx0:0) static-port # Automatic outbound rule nat on vmx0 inet from (vmx1:network) to any -> (vmx0:0) port 1024:65535 # Automatic outbound rule nat on vmx0 inet from (lo0:network) to any -> (vmx0:0) port 1024:65535 # Automatic outbound rule nat on vmx0 inet from 127.0.0.0/8 to any -> (vmx0:0) port 1024:65535 # Automatic outbound rule

AdSchellevis commented 1 year ago

Just to be sure, you are not able to ping 8.8.8.8 from this machine? if that's the case, it's probably a good idea to capture some traffic first. so far all looks normal, I don't expect your machine is sending out traffic with address 0.0.0.0

Pinoir commented 1 year ago

Actually, I can.

$ ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: icmp_seq=0 ttl=119 time=6.735 ms 64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=7.303 ms 64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=6.777 ms 64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=7.069 ms ^C --- 8.8.8.8 ping statistics --- 4 packets transmitted, 4 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 6.735/6.971/7.303/0.231 ms

For comparison, here's pings to www.bbc.co.uk:

$ ping 212.58.233.253 PING 212.58.233.253 (212.58.233.253): 56 data bytes ^C --- 212.58.233.253 ping statistics --- 6 packets transmitted, 0 packets received, 100.0% packet loss

$ ping -S 192.168.11.250 212.58.233.253 PING 212.58.233.253 (212.58.233.253) from 192.168.11.250: 56 data bytes 64 bytes from 212.58.233.253: icmp_seq=0 ttl=55 time=7.155 ms 64 bytes from 212.58.233.253: icmp_seq=1 ttl=55 time=6.673 ms 64 bytes from 212.58.233.253: icmp_seq=2 ttl=55 time=6.705 ms ^C --- 212.58.233.253 ping statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 6.673/6.844/7.155/0.220 ms

$ ping -S 192.168.0.250 212.58.233.253 PING 212.58.233.253 (212.58.233.253) from 192.168.0.250: 56 data bytes 64 bytes from 212.58.233.253: icmp_seq=0 ttl=55 time=6.626 ms 64 bytes from 212.58.233.253: icmp_seq=1 ttl=55 time=6.877 ms 64 bytes from 212.58.233.253: icmp_seq=2 ttl=55 time=7.155 ms 64 bytes from 212.58.233.253: icmp_seq=3 ttl=55 time=6.679 ms ^C --- 212.58.233.253 ping statistics --- 4 packets transmitted, 4 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 6.626/6.834/7.155/0.208 ms

AdSchellevis commented 1 year ago

ok, doesn't sound like the same problem to me.

AdSchellevis commented 1 year ago

I had a similar case with @mimugmail yesterday, was able to reproduce and seemed to be caused by default gateway switching. @fichtner added a patch in https://github.com/opnsense/core/commit/0e286b3a34cf366efa88a2616e1e1a0c22b8180c, if the issue returns after 23.1.1 we can always reopen this ticket.