opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.25k stars 725 forks source link

Possible configd problem: Sections of config.xml related to NAT are not evaluated anymore #7562

Closed noseshimself closed 2 months ago

noseshimself commented 2 months ago

Important notices

Describe the bug After an involuntary update of the master node of a HA cluster from OPNsense 24.1.5_3-amd64 to OPNsense 24.1.9_4-amd64 all 1:1 NAT settings were gone on the running system (gone as in

image ) although the relevant parts of the configuration file are still in the correct positions and completely intact:

   <onetoone>
      <external>217.7.50.198</external>
      <category/>
      <descr>nextcloud.gerstel.com</descr>
      <interface>lan</interface>
      <type>binat</type>
      <source>
        <address>192.168.111.8</address>
      </source>
      <destination>
        <any>1</any>
      </destination>
    </onetoone>

The slave system running OPNsense 24.1.5_3-amd64 is still working after pushing the configuration over image so I'm assuming that the syntax was sufficiently correct even after the upgrade.

Tip: to validate your setup was working with the previous version, use opnsense-revert (https://docs.opnsense.org/manual/opnsense_tools.html#opnsense-revert)

As the bug is stopping the production of a large number of internet-facing servers we had to demote the master to slave and are now using the backup system as master via CARP.

With python being upgraded from 3.9 to 3.11 reverting seems to be an extremely impractical solution, too.

To Reproduce

Steps to reproduce the behavior:

  1. Set up (1:1?) NAT rules
  2. Verify them to be working
  3. Upgrade
  4. See error

Expected behavior

NAT rules being applied to the running system upon booting.

Describe alternatives you considered

Crying loudly. Seeing the rules still being available on the backup system, stopping to cry and failing over.

Relevant log files

The log file got considerably larger after the update but I can't seem to find anything relevant.

-rw-------   1 root  wheel   510405 Jun 27 01:33 configd_20240627.log
-rw-------   1 root  wheel  5386277 Jun 26 23:59 configd_20240626.log
-rw-------   1 root  wheel   264174 Jun 25 23:59 configd_20240625.log
-rw-------   1 root  wheel   263266 Jun 24 23:59 configd_20240624.log
-rw-------   1 root  wheel   264333 Jun 23 23:59 configd_20240623.log

Environment

Software version used and hardware type if relevant, e.g.: User-Agent Mozilla/5.0 (X11; CrOS x86_64 14541.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 FreeBSD 13.2-RELEASE-p11 stable/24.1-n255023-99a14409566 SMP amd64 OPNsense 24.1.9_4 908aac04e Plugins os-OPNProxy-1.0.5_1 os-OPNWAF-1.1 os-OPNcentral-1.7 os-dmidecode-1.1_1 os-dyndns-1.27_3 os-lldpd-1.1_2 os-maltrail-1.10 os-nextcloud-backup-1.0_1 os-redis-1.1_2 os-rfc2136-1.8_2 os-shadowsocks-1.1 os-squid-1.0_2 os-sunnyvalley-1.4_3 os-theme-cicada-1.35 os-theme-rebellion-1.8.10 os-theme-tukan-1.27_1 os-theme-vicuna-1.45_1 os-vnstat-1.3_1 Time Thu, 27 Jun 2024 01:08:27 +0000 OpenSSL 3.0.14 Python 3.11.9 PHP 8.2.20

fichtner commented 2 months ago

You need to upgrade the slave as well.

noseshimself commented 2 months ago

Sorry, no.

If you were right, turning off the slave and only running the updated master node should fix the problem.

1) Turn off both routers. 2) turn on gw-ext-1. 3) No 1:1 NAT.

fichtner commented 2 months ago

I don’t have your setup nor a way to support you through community support to assess your current config.xml state.

noseshimself commented 2 months ago

I don’t have your setup nor a way to support you through community support to assess your current config.xml state.

I could of course send the current config.xml to you; I have a serious problem testing it as I do not have an identical device I can take off-line and test with the current configuration or I would have done that for verification.

All I really need is someone putting that config.xml into a current OPNsense at factory defaults and see if the 1:1 NAT mappings are there to verify whether this is a config problem or a firmware problem.

Or you tell me how I can opnsense-revert down to OPNsense 24.1.5_3-amd64 without the python downgrade killing me on the way...

noseshimself commented 2 months ago

Thank you for getting me several steps ahead...

The issue title probably needs changing; it is a migration problem.

One is negligible: The shadowsocks migration is failing; i'll split that one off, see https://github.com/opnsense/core/issues/7578

The problem referred here is logged as

<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="14"] [OPNsense\Firewall\Filter:npt.rule.d8addb07-6908-4e73-84c2-3ff93be2af91.destination_net] Please specify a valid network segment or IP address.{2003:4e:6010::b:217.7.50.193/128}
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="15"] Model OPNsense\Firewall\Filter can't be saved, skip ( OPNsense\Base\ValidationException: [OPNsense\Firewall\Filter:npt.rule.d8addb07-6908-4e73-84c2-3ff93be2af91.destination_net] Please specify a valid network segment or IP address.{2003:4e:6010::b:217.7.50.193/128}
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="16"]  in /usr/local/opnsense/mvc/app/models/OPNsense/Base/BaseModel.php:649
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="17"] Stack trace:
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="18"] #0 /usr/local/opnsense/mvc/app/models/OPNsense/Base/BaseModel.php(774): OPNsense\Base\BaseModel->serializeToConfig()
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="19"] #1 /usr/local/opnsense/mvc/script/run_migrations.php(54): OPNsense\Base\BaseModel->runMigrations()
<147>1 2024-06-28T13:58:30+00:00 gw-ext-1.gerstel.com config 15771 - [meta sequenceId="20"] #2 {main} )

The config.xml has been cleaned; all lines with passwords are gone so HA/carp migration might fail. Don't worry, that part is working config.txt (sorry for naming it .txt I needed to pass it through the security checkpoint and forgot changing it back).

AdSchellevis commented 2 months ago

@noseshimself if you remove the entry with 2003:4e:6010::b:217.7.50.193/128 the issue might be solved, it's indeed bad input data. At a first I don't expect a bug yet.

If you fetch the old overview page using:

curl -o /usr/local/www/firewall_nat_1to1.php https://raw.githubusercontent.com/opnsense/core/stable/23.7/src/www/firewall_nat_1to1.php

You should be able to remove the item via the then available firewall_nat_1to1.php page on the box, next trigger the migration again.

Don't forget to remove the old file when you're done.

noseshimself commented 2 months ago

The only occurrences of "2003:4e:6010" in config.xml are

      <gateway_item uuid="e7d69e13-3aac-4503-8f28-4a99bf68838e">
        <disabled>0</disabled>
        <name>GW_TBusinessConnect_IPv6</name>
        <descr>Default-Router im T-BusinessConnect (IPv6)</descr>
        <interface>lan</interface>
        <ipprotocol>inet6</ipprotocol>
        <gateway>2003:4e:6010::1</gateway>
        <defaultgw>1</defaultgw>
        <fargw>0</fargw>
        <monitor_disable>1</monitor_disable>
        <monitor_noroute>0</monitor_noroute>
        <monitor/>
        <force_down>0</force_down>
        <priority>255</priority>
        <weight>1</weight>
        <latencylow/>
        <latencyhigh/>
        <losslow/>
        <losshigh/>
        <interval/>
        <time_period/>
        <loss_interval/>
        <data_length/>
      </gateway_item>
    <LAN>
      <if>igb3</if>
      <descr>SYNC</descr>
      <enable>1</enable>
      <lock>1</lock>
      <spoofmac/>
      <ipaddr>192.168.2.1</ipaddr>
      <subnet>30</subnet>
    </LAN>
    <lan>
      <if>igb1_vlan11</if>
      <descr>TBusinessConnect</descr>
      <enable>1</enable>
      <lock>1</lock>
      <spoofmac/>
      <blockpriv>1</blockpriv>
      <blockbogons>1</blockbogons>
      <ipaddr>217.7.50.229</ipaddr>
      <subnet>29</subnet>
      <ipaddrv6>2003:4E:6010::B:217.7.50.229</ipaddrv6>
      <subnetv6>48</subnetv6>
      <gatewayv6>GW_TBusinessConnect_IPv6</gatewayv6>
    </lan>

And to be honest this is irritating me already because I can't find any log entries of system administrators in the ticket system telling me where the static IPv6 address came from. or why they are tagged as "LAN" and "lan" and why the sync connection is "LAN"...

The slave router doesn't have this in its configuration.

There is nothing referring to it in the NAT section at all:

    <onetoone>
      <external>217.7.50.193</external>
      <descr>proxy.gerstel.com</descr>
      <interface>lan</interface>
      <type>binat</type>
      <source>
        <address>192.168.111.30</address>
      </source>
      <destination>
        <any>1</any>
      </destination>
    </onetoone>

and using the old version of the page does not show anything IPv6-y that can be fixed at all.

Monviech commented 2 months ago

Sorry for budding in but theoretically this is a valid ipv6 address.

It is called IPv6 with embedded IPv4 address. 2003:4e:6010::b:217.7.50.193

The last 16bits are allowed to be used like this.

https://datatracker.ietf.org/doc/html/rfc4291#section-2.2 check the third example.

Edit: Oops this is about 1:1 NAT, sorry xD. Just realized.

noseshimself commented 2 months ago

You did not read the problem description: Nobody set a static IPv6 address on that interface (I can't find any documentation related to it) and even if somebody might have done so, there should not be any implicit NAT rule that cannot be migrated in a section that did not contain any before migration.

Besides that: This is the outward-facing interface and it does not seem to be a good idea to add an IPv6 address there not being assigned by the rated provider (and I know the IPv6 block assigned there).

noseshimself commented 2 months ago

@noseshimself if you remove the entry with 2003:4e:6010::b:217.7.50.193/128 the issue might be solved, it's indeed bad input data. At a first I don't expect a bug yet.

I was looking in the wrong place, but I guess this has to be added to https://github.com/opnsense/core/issues/7578 and I should have read the error message.

If I run the migration script on the configuration of the (still working) slave something (the shadowsocks migration?) is adding this?

    <npt>
      <category/>
      <descr>proxy.gerstel.com</descr>
      <interface>lan</interface>
      <source>
        <address>FC47:5253:544C::6F:192.168.111.30</address>
      </source>
      <destination>
        <address>2003:4E:6010::B:217.7.50.193</address>
      </destination>
    </npt>

and as we did not do anything IPv6 there I never expected rules showing up there. I just checked all configuration backups back to 2022 and found the first daily change where the IPv6 entries started showing up but can't ask the responsible admin anymore -- he left.

After removing the npt-related section from config.xml and rerunning

root@gw-ext-1:/conf # /usr/local/opnsense/mvc/script/run_migrations.php
*** OPNsense\Shadowsocks\Local Migration failed, check log for details
Migrated OPNsense\Firewall\Filter from 0.0.0 to 1.0.4

If I'm reapplying run_migrations.php to the config file before the upgrade I'm getting the same error messages again so the root cause of the problem is the npt entry that never caused any problem (e.g. annoying messages like "hey, I was told to set up an IPv6 NAT rule for an address that is nowhere to be found")

AdSchellevis commented 2 months ago

@noseshimself case closed then?

noseshimself commented 2 months ago

I would say so but someone has to find out where the npt entry was coming from. The routers in question were not doing anything with the IPv6 addresses I found.

AdSchellevis commented 2 months ago

let's close this then, tracking origins of local configuration changes is not something we can assist with in community time here.

noseshimself commented 2 months ago

(I'm digging through four years of nightly backups to find out how this npt mapping was created and why it is not on the slave -- should I find proof that it was automatically created after installing the exit part of shadowsocks on the gateway I'll open a new issue.)