Closed sempervictus closed 10 months ago
@fichtner - thanks, sorry for the incorrect placement.
Spooky bug, guessing that its driver-level given the symptoms observed, but might be something in the BSD networking tier (i am no such BSD/PF guru) itself as relating to some malformation via the virtual adapter up to the packet processing level or an interface problem of some sort. The fact that i can tcpdump
the inbound packets is just that much more confusing - clearly they're well-formed-enough to be parsed and written to STDOUT. However, not seeing any denies in the firewall log relating to these packets (or allow actions when forcing logging to be enabled for all actions), despite them being seen by the adapter as L2 frames which tcpdump
reassembles into L3+ packets without throwing errors
Can you wait for 13-based beta? It‘s only two more weeks I think.
'Course - only real breakage is RADIUS can't work like this, most of the traffic they handle is routed.
To make this so much more messed up: seeing this on one of the older openstack instances, on only one of the interfaces facing internal networks - all interfaces show ARP entries, but only one can actually communicate with neighbors.
Another fun bit: the GUI-based "HW offload" selectors don't work as the hw.vtnet
sysctls are separate and need to be manually created in the tunables section:
# sysctl -a | grep hw.vtnet
hw.vtnet.rx_process_limit: 512
hw.vtnet.mq_max_pairs: 8
hw.vtnet.mq_disable: 1
hw.vtnet.lro_disable: 1
hw.vtnet.tso_disable: 1
hw.vtnet.csum_disable: 1
Unfortunately, this too has no effect. One instance accepts local traffic but will not route/nat it out, the other instance routes just fine but wont accept local traffic. There's no good way to change interface types on-the-fly in a cloud environment (at least in OpenStack). This might actually be a bigger problem than i thought... @fichtner - any thoughts on potential workarounds we could use while the BSD13 beta is cooked up?
Found some suggestions to boot the VM as a q35
instance, but that's not flying too well - bootloader failure and multiboot
can't find a usable volume (nano image being used in the private cloud - need to figure out how to build proper cloud images on ZFS instead), so had to go back to the default pc
type.
@AdSchellevis, @fichtner - i think i've found the upstream bug in freebsd. They talk about disabling csum, which i've done inside the FWs themselves in various ways. However, the piece about doing it at the hypervisor level won't fly, even for us where we control the entire cloud. The way Nova works is that it regenerates the libvirt XML for a domain (instance/machine/whatever) on an ongoing basis. This is also why swapping out the NICs to e1000
's won't fly - unless there's a Nova driver interface to define those parameters, it'll discard them on XML-regen. I can, for now change the database-assigned NIC type which will make Nova generate an e1000 for it when it boots next, but these things sit atop an 80Gbit fabric with a raw 10Gbit WAN uplink and the e1000 won't get anywhere near those speeds.
I have not yet tracked down any patches for this, but i'm somewhat of an out-of-towner in BSD-world (pun intended), so its kind of a slow process. If i do find any such patches, do those go here as PRs or how is that handled?
@sempervictus if you find patches in any BSD that might be related that would help. It's sort of tricky to "find" "the issue" in the code without any reference to code itself. The workarounds are probably just that. I wouldn't be surprised if a larger blob of code was missing in FreeBSD to glue this in correctly.
@fichtner: looks like the betas are up now - whats the process for testing off the beta/development channel on a system deployed from community?
Changed to development, check and install. Check and install again following the major upgrade afterwards. It will keep asking you to do a major upgrade, but you only need to do it once.
I confirm a similar issue with OPNsense 22.7.9_3-amd64 based on FreeBSD 13.1-RELEASE-p5 by using QEmu and q35 virtio nic. Everything works fine with QEmu i440 engine.
Steps to reproduce:
Closing stale issue. Not much we can do.
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
Describe the bug
Traffic routed over an opnsense instance to a windows VM from a remote subnet works correctly (and visa versa), traffic between opnsense and the windows instance in its subnet however does not. Pinging from the windows host shows the inbound ICMP packets on
tcpdump
inside the firewall - the traffic makes it to the iface but then there's no response. ARP tables are updated (obviously or routing would fail), just no L3 response whatsoever. The NIC types on both VMs arevirtio
and the underlying host kernels are 5.10. We're seeing this inside openstack, even with all port-security functions disabled (if port security were the issue, we wouldn't be seeing those inbound packets). This has apparently been going on for some time, probably since around the 21.7 release. Changing hardware offload options does nothing.To Reproduce
Steps to reproduce the behavior:
Expected behavior
L3 works in subnet
Environment
Software version used and hardware type if relevant, e.g.:
OPNsense 21.7.4-amd64 FreeBSD 12.1-RELEASE-p20-HBSD OpenSSL 1.1.1l 24 Aug 2021 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (2 cores) VirtIO NIC (kernel 5.10.x)