opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.32k stars 742 forks source link

Silent config rollbacks #6565

Closed Daggolin closed 1 year ago

Daggolin commented 1 year ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

Our OPNsense instance silently reverted to an old config from several months ago.

What happened (username and hostname changed; ids removed)

In order to add a new rule for a device on the network I first added a new alias (Firewall > Aliases) by cloning an existing alias and pressing the "Apply" button below. When trying to create the actual firewall rule (Firewall > Rules > Interface) the alias did not show up. As I was certain I created the alias I decided to check the config history for what actually happened. To my surprise the latest change, caused by daggolin@172.18.0.220: /api/firewall/alias/set made changes deleted several certificates, users, dhcp entries, changed gateways, etc.

After comparing the new config with a few backups it was clear that the config of the system had been reverted to a backup from early February.

The audit log on our syslog server showed the following entries for the day:

May 12 12:07:21 fw01 audit[78246]: user daggolin authenticated successfully for WebGui [using OPNsense\Auth\Services\WebGui + OPNsense\Auth\Local]
May 12 12:07:21 fw01 audit[78246]: /index.php: Successful login for user 'daggolin' from: 172.18.0.220
May 12 12:17:05 fw01 audit[35697]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683886625.1227.xml in /api/firewall/alias/setItem/<item id>
May 12 12:17:09 fw01 audit[78246]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683886629.2737.xml in /api/firewall/alias/set
May 12 12:34:12 fw01 audit[35697]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683887652.5442.xml in /vpn_ipsec_phase2.php?ikeid=2 [/vpn_ipsec_phase2.php made changes]
May 12 12:34:38 fw01 audit[35697]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683887678.3447.xml in /vpn_ipsec_phase2.php?dup=<dup> [/vpn_ipsec_phase2.php made changes]
May 12 12:39:01 fw01 audit[26141]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683887941.0836.xml in /vpn_ipsec_phase1.php?p1index=1 [/vpn_ipsec_phase1.php made changes]
May 12 12:40:31 fw01 audit[26979]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683888031.6838.xml in /vpn_ipsec_phase1.php?p1index=1 [/vpn_ipsec_phase1.php made changes]
May 12 12:41:29 fw01 audit[63509]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683888089.22.xml in /api/firewall/alias/addItem/
May 12 12:41:29 fw01 audit[35610]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683888089.3336.xml in /api/firewall/alias/addItem/
May 12 12:41:36 fw01 audit[26979]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683888096.1641.xml in /api/firewall/alias/set
May 12 12:43:03 fw01 audit[26141]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683888183.6452.xml in /api/firewall/alias/addItem/
May 12 12:43:11 fw01 audit[91444]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683888191.1208.xml in /api/firewall/alias/set
May 12 13:31:44 fw01 audit[91444]: user daggolin@172.18.0.220 changed configuration to /conf/backup/config-1683891103.9442.xml in /diag_backup.php [/diag_backup.php made changes]
May 12 13:36:06 fw01 audit[27529]: user daggolin authenticated successfully for WebGui [using OPNsense\Auth\Services\WebGui + OPNsense\Auth\Local]
May 12 13:36:06 fw01 audit[27529]: /index.php: Successful login for user 'daggolin' from: 172.18.0.220

The entries sticking out are the two at May 12 12:41:29, because they indicate the attempt to clone the alias triggered twice. The config history on the WebGui indicates the revert of the history happened at May 12 12:41:36.

Checking the current source of the repository I have formed a working theory on what may have happened. Within Config::init (https://github.com/opnsense/core/blob/c568629c01cb046d0905b26d3b3de1e3293b1b4b/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L310-L344) the Config class tries to revert to the pervious config if loading of the current one fails. If loading of the reverted config fails it tries again and again untill it has run out of all configs and tries to perform a factory reset.

Searching for No valid config.xml on our syslog server yields the following results:

May 12 12:41:29 fw01 config[63509]: [2023-05-12T12:41:29+02:00][ERROR] No valid config.xml found, attempting last known config restore.
May 12 12:41:29 fw01 config[26979]: [2023-05-12T12:41:29+02:00][ERROR] No valid config.xml found, attempting last known config restore.
May 12 12:41:29 fw01 config[29599]: [2023-05-12T12:41:29+02:00][ERROR] No valid config.xml found, attempting last known config restore.

The following entries for the same time show that the system did indeed try to restore backup after backup:

May 12 12:41:29 fw01 config[63509]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683888089.3336.xml
May 12 12:41:29 fw01 config[26979]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683888089.3336.xml
May 12 12:41:29 fw01 config[63509]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683888089.22.xml
May 12 12:41:29 fw01 config[26979]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683888089.22.xml
May 12 12:41:29 fw01 config[63509]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683888031.6838.xml
May 12 12:41:29 fw01 config[26979]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683888031.6838.xml
May 12 12:41:29 fw01 config[63509]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683887941.0836.xml
May 12 12:41:29 fw01 config[26979]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683887941.0836.xml
May 12 12:41:29 fw01 config[63509]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683887678.3447.xml
May 12 12:41:29 fw01 config[26979]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683887678.3447.xml
May 12 12:41:29 fw01 config[63509]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683887652.5442.xml
May 12 12:41:29 fw01 config[26979]: [2023-05-12T12:41:29+02:00][ERROR] failed restoring /conf/backup/config-1683887652.5442.xml
[..]

The time matches the one of the duplicated alias/addItem call when creating the alias. The WebGui probably attributes all the changes to alias/set, because it was the first change to occur after the rollback. After all, the restore action itself doesn't create a new config to show up in the history.

It looks to me as if the Config::load function is missing a shared lock when reading the config, leading to the incorrect conclusion in the Config::init function that the current config would be broken and a rollback is required. The duplicated call to alias/addItem may have caused this in our case.

What doesn't quite add up is the fact that our logs show 2 processes simultaneously attempting to restore the config with a third one chiming in a bit later within the same second. However, as the Config::init function is the one performing the restore attempts it would seem likely that, if the first process was held up, the chance of multiple processes running into this at the same time is high.

Searching the archive of our syslog server we found a few other instance of OPNsense trying to restore an old config in 2022 with only one or two failed restore attempts, causing it to fly under our radar.

Expected behavior

We expected 3 things to behave different:

To Reproduce

We could artifically reproduce the same bug through the WebGui on another instance. The rollback failed only once or twice, not 70 times in a row, but the logs showed the same behavior as the rollback above.

Steps to reproduce the behavior:

  1. Go to Firewall > Aliases
  2. Open the menu to add, edit or clone an alias
  3. Fill the form with valid data
  4. Select the save button (using TAB)
  5. Press enter a couple of times as fast as possible

Describe alternatives you considered

In our case the system went back by about 70 backups. With a smaller amount of backups stored on the system (like the default setting of 10) the restore attempts, as well as the factory reset, would probably have failed and we might not even have noticed. As this is a timing issue reducing the amount of backups may generally decrease the chance of a big rollback occuring like this, but there is always the risk of randomly hitting a factory reset.

Environment

Versions:

CPU type:

AdSchellevis commented 1 year ago

I've never seen this before unless very weird things happen with the machine (disk full, data corruption, ..), but it might make sense to log the actual error in the try {} catch (\Exception $e) block to know why it can't do this.

The rational of falling back is quite simple, without a config, nothing will work, which also would prevent an administrator from making any other choices. In most cases the situation before the change (which should be the first record in $this->getBackups() is the most sensible option).

If you can trigger this quite easily, I can offer you a patch which logs the actual error. Further the contents of /conf/backups would be helpful (why does it fall back to February) and the output of df would be helpful.

Daggolin commented 1 year ago

I do not believe this to be an issue on our system, because we can reproduce this behavior on multiple systems using the steps provided in the original post.

I cannot offer you the contents of our backups, because they contain sensitive information, but their content should not matter anyway. We have several hundred backups in /conf/backups and the config files that failed to restore according to the logfile are all valid config files that can be restored manually. It's not the config at fault, it's the code.

Here is the output of df. As you can see, the drive is not full:

$ df
Filesystem               1K-blocks    Used     Avail Capacity  Mounted on
/dev/gpt/rootfs          227374316 8130664 201053708     4%    /
devfs                            1       1         0   100%    /dev
devfs                            1       1         0   100%    /var/dhcpd/dev
devfs                            1       1         0   100%    /var/unbound/dev
devfs                            1       1         0   100%    /var/unbound/dev
/usr/local/lib/python3.9 227374316 8130664 201053708     4%    /var/unbound/usr/local/lib/python3.9

As I previously wrote, I believe the issue to be a missing lock when reading the config. The Config::save method locks the config file for writing (exclusive lock) and truncates it (https://github.com/opnsense/core/blob/bde52467de6f0e777a207d68a5a4f4eb20824d05/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L672-L674). It then writes the string representation of the current config into the file (https://github.com/opnsense/core/blob/bde52467de6f0e777a207d68a5a4f4eb20824d05/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L675), which implicitly calls the magic method Config::__toString that crafts the XML representation of the config before writing it to the file. During the execution of the __toString method the config file on disk is empty. Depending on the size of the config string and the inner workings of the fwrite method the string may, at times, be partially written to the file. After all is done, the Config::save method releases the lock again (https://github.com/opnsense/core/blob/bde52467de6f0e777a207d68a5a4f4eb20824d05/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L695). Between the locking and unlocking of the file the state of the file is undefined. No other process may read the file while it's locked or the outcome of the read operation is undefined.

The Config::load method opens the config file using fopen (https://github.com/opnsense/core/blob/bde52467de6f0e777a207d68a5a4f4eb20824d05/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L403-L410) and passes the obtained stream to Config::loadFromStream, which uses fseek to get to the beginning of the stream and stream_get_contents to read it (https://github.com/opnsense/core/blob/bde52467de6f0e777a207d68a5a4f4eb20824d05/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L366-L367). Neither Config::load nor Config::loadFromStream try to obtain a lock for reading (shared lock). As they're not locking the file they ignore the existing exclusive lock of a concurrent Config::save call, which means the attempt to load the file would yield an empty or incomplete config. As this is going to throw an exception the Config::init method is going to catch the Exception and try to restore a previous config (https://github.com/opnsense/core/blob/bde52467de6f0e777a207d68a5a4f4eb20824d05/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L315-L343). This attempt to restore the config calls Config::restoreBackup, which then tries to restore the given config backup by calling copy on the backup path and the config path (https://github.com/opnsense/core/blob/bde52467de6f0e777a207d68a5a4f4eb20824d05/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L605). My tests indicate that the copy method disregards the existing lock of Config::save and overwrites the config. When that happens the existing handle in Config::save remains valid, causing parts of the config to be overwritten whenever fwrite flushes to the file, leading to a broken config file. When Config::restoreBackup calls Config::load it is possible for it to fail again due to the possibility of a partially overwritten config (https://github.com/opnsense/core/blob/bde52467de6f0e777a207d68a5a4f4eb20824d05/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L606). As the content of the config depends on timing this can lead to a race for the config that possibly cycles through many old configs in the process.

Here is a simple script test1.php that opens the config, locks the file and truncates it, similar to Config::save. It then sleeps for 30 seconds before restoring the contents of the config. If an instance of the WebGui is opened during the 30 seconds sleep the config is going to get restored by Config::init, disregarding the lock. This shows that Config::load doesn't obtain a shared lock nor does copy respect the existing exclusive lock, leading to a risky race condition:

<?php
    // Open the config read- and writeable, like `Config::load`
    $config_file = "/conf/config.xml";
    $config_file_handle = fopen($config_file, "r+");

    // Debug output
    echo("Got config handle\n");

    // Obtain an exclusive lock, like `Config::save`
    if (flock($config_file_handle, LOCK_EX)) {

        // Debug output
        echo("Obtained lock\n");

        // Seek the beginning
        fseek($config_file_handle, 0);

        // Read the contents of the config, because we want to restore it afterwards
        $contents = stream_get_contents($config_file_handle);

        // Debug output
        echo("Read current config\n");

        // Seek the beginning and truncate the file, like `Config::save`
        fseek($config_file_handle, 0);
        ftruncate($config_file_handle, 0);

        // Debug output
        echo("Erased current config, sleeping for 30 seconds\n");

        // Sleep for 30 seconds. Pretend the conversion of `Config::__toString` takes a bit longer. If the WebGui is opened during this time `Config::init` is going to restore the latest config (successfully, disregarding the lock)
        sleep(30);

        // Write the config contents back to the file
        fwrite($config_file_handle, $contents);

        // Flush and unlock like `Config::save`
        fflush($config_file_handle);
        flock($config_file_handle, LOCK_UN);

        // Debug output
        echo("Restored old config and released lock\n");
    } else {
        // Print error
        echo("Error: Failed to lock config\n");
    }

    // Close config
    fclose($config_file_handle);

    // Debug output
    echo("Closed file\n");
?>

Furthermore the Lobby > Dashboard interacts with the backend in the background. If any user has an open browser tab of the dashboard the config gets restored within a second after the script truncates it. If multiple instances of the dashboard are already open when the script is run the log usually shows that multiple processes restore the config. This implies that even the restore attempts may find themselves in a race.

As copy does not respect the existing lock of Config::save the Config::restoreBackup should obtain an exclusive lock as well.

The rational of falling back is quite simple, without a config, nothing will work, which also would prevent an administrator from making any other choices. In most cases the situation before the change (which should be the first record in $this->getBackups() is the most sensible option).

I understand the idea and I agree, that - in most cases - the latest backup would be the best option. However, what the current code is doing is not just that: if restoring of the config failed for some reason it tries the next older one until it succeeds. While "looking for the latest restorable config" may sound good initially there is possible issues to consider. For instance, if something did indeed corrupt the config and multiple backups, how can you trust that the backup isn't corrupted in another way that causes issues, but doesn't fail to parse? Or if the config couldn't be written due to the disk being full, cycling through the backups until one of them is small enough doesn't necessarily mean an admin can access the system afterwards, but it might mean that someone else - who shouldn't be able to - can. If the restore attempt goes further than one config backup the outcome is technically outdated and may potentially pose a security risk. Depending on the deployment a shutdown of the system may be preferable to "something works" and "internal data is publicly accessible". Knowing of the "try all backups until one works" behavior we can counteract it with a cronjob that deletes all outdated config backups, making this less relevant.

Either way, the system should inform admins about the config restore and require some kind of acknowledgement before any other actions can be performed. The currently silent rollbacks that can occur due to race conditions make it hard to trust the system, because you never know for sure if the system is still running the way you configured it or if the system silently reverted to an old config.

AdSchellevis commented 1 year ago

@Daggolin would it be possible to also offer a listing of the files in the backup (ls -alsh /conf/backup/)? (I'm not interested in the content, just want to check if we're traversing to the correct backup and how large the files are, ergo are we looking to two issues or one).

Although usually the truncate() and write actions are close together, it does sound reasonable that this is indeed the reason for your issue . We need to check the failsafe procedure, but certainly shouldn't replace it. When the configuration file is damaged during boot (ufs is notorious for damaging files), there's just not much most people can do and the last known working copy usually is a safe fallback.

In the long run it might be an option to issue a system status message for cases like these, but that's not very high on my todo list to be honest.

AdSchellevis commented 1 year ago

@Daggolin can you try https://github.com/opnsense/core/commit/bebf3a2a7c2de1eb102ff41aefc401d8e999e52f, it should address the accidental restore issue.

Daggolin commented 1 year ago

@Daggolin would it be possible to also offer a listing of the files in the backup (ls -alsh /conf/backup/)? (I'm not interested in the content, just want to check if we're traversing to the correct backup and how large the files are, ergo are we looking to two issues or one).

Yes, I can grep the relevant entries. If you compare the listing for May to the logfile from the original post you can see that it properly started with the later of the two files for 1683888089 and then went backwards in the correct order. The 3 configs after config-1683888089.3336.xml are smaller in size, because the system restored the February config. After that we restored a backup from earlier that day, returning the file size to 576K.

$ ls -alsh /conf/backup/ | grep May | grep -v 2022
 56 drwxr-xr-x  2 root  wheel    55K May 17 12:18 .
  4 drwxr-xr-x  4 root  wheel   512B May 12 13:32 ..
608 -rw-r--r--  1 root  wheel   564K May  4 12:54 config-1683197648.5739.xml
608 -rw-r--r--  1 root  wheel   564K May  4 13:24 config-1683199488.152.xml
608 -rw-r--r--  1 root  wheel   574K May  7 10:09 config-1683446988.5867.xml
608 -rw-r--r--  1 root  wheel   574K May  7 10:10 config-1683447024.2656.xml
608 -rw-r-----  1 root  wheel   574K May  8 20:34 config-1683570880.0428.xml
608 -rw-r-----  1 root  wheel   576K May  8 20:36 config-1683570984.0718.xml
640 -rw-r-----  1 root  wheel   576K May  8 20:36 config-1683571015.0515.xml
640 -rw-r-----  1 root  wheel   576K May  8 20:37 config-1683571040.461.xml
640 -rw-r-----  1 root  wheel   576K May  8 20:38 config-1683571136.665.xml
640 -rw-r-----  1 root  wheel   576K May  8 20:38 config-1683571138.9367.xml
640 -rw-r-----  1 root  wheel   576K May  8 20:39 config-1683571164.9701.xml
640 -rw-r-----  1 root  wheel   576K May  8 20:39 config-1683571187.9101.xml
640 -rw-r-----  1 root  wheel   576K May  8 20:49 config-1683571789.4143.xml
640 -rw-r-----  1 root  wheel   576K May  8 20:49 config-1683571795.8813.xml
640 -rw-r-----  1 root  wheel   576K May  8 21:04 config-1683572643.3956.xml
640 -rw-r-----  1 root  wheel   576K May  8 21:04 config-1683572650.8864.xml
640 -rw-r-----  1 root  wheel   576K May 12 12:17 config-1683886625.1227.xml
640 -rw-r-----  1 root  wheel   576K May 12 12:17 config-1683886629.2737.xml
640 -rw-r-----  1 root  wheel   577K May 12 12:34 config-1683887652.5442.xml
640 -rw-r-----  1 root  wheel   578K May 12 12:34 config-1683887678.3447.xml
640 -rw-r-----  1 root  wheel   578K May 12 12:39 config-1683887941.0836.xml
640 -rw-r-----  1 root  wheel   578K May 12 12:40 config-1683888031.6838.xml
640 -rw-r-----  1 root  wheel   578K May 12 12:41 config-1683888089.22.xml
640 -rw-r-----  1 root  wheel   578K May 12 12:41 config-1683888089.3336.xml
576 -rw-r-----  1 root  wheel   538K May 12 12:41 config-1683888096.1641.xml
576 -rw-r-----  1 root  wheel   538K May 12 12:43 config-1683888183.6452.xml
576 -rw-r-----  1 root  wheel   538K May 12 12:43 config-1683888191.1208.xml
640 -rw-r-----  1 root  wheel   576K May 12 13:31 config-1683891103.9442.xml
640 -rw-r--r--  1 root  wheel   576K May 15 10:46 config-1684140406.7405.xml
640 -rw-r--r--  1 root  wheel   576K May 15 11:16 config-1684142167.9025.xml
640 -rw-r--r--  1 root  wheel   576K May 16 07:46 config-1684215970.6892.xml
640 -rw-r--r--  1 root  wheel   576K May 16 08:14 config-1684217695.6682.xml
640 -rw-r--r--  1 root  wheel   577K May 17 11:59 config-1684317562.2877.xml
640 -rw-r--r--  1 root  wheel   578K May 17 12:00 config-1684317613.7717.xml
640 -rw-r--r--  1 root  wheel   578K May 17 12:04 config-1684317862.9582.xml
640 -rw-r--r--  1 root  wheel   578K May 17 12:04 config-1684317892.546.xml
640 -rw-r--r--  1 root  wheel   578K May 17 12:11 config-1684318299.7091.xml
640 -rw-r--r--  1 root  wheel   579K May 17 12:14 config-1684318475.0025.xml
640 -rw-r--r--  1 root  wheel   579K May 17 12:14 config-1684318490.1859.xml
640 -rw-r--r--  1 root  wheel   580K May 17 12:18 config-1684318707.1976.xml

@Daggolin can you try bebf3a2, it should address the accidental restore issue.

I don't have the toolchain setup to build the complete OPNsense at that revision, but I manually copied the Config.php from it onto our test system and the race condition when reading the config seems to have been resolved. There is however still a race when restoring the config. If you adjust the test script I provided before to write a faulty config file instead of restoring the config before releasing the lock you can still witness multiple processes performing the restore action. An exclusive lock for Config::restoreBackup would help with that.

Could you elaborate on why you are first trying to obtain a shared lock in a non-blocking way and then use an exclusive lock in Config::loadFromStream (https://github.com/opnsense/core/blob/bebf3a2a7c2de1eb102ff41aefc401d8e999e52f/src/opnsense/mvc/app/library/OPNsense/Core/Config.php#L371)? It looks like you first try to obtain a shared lock with a non-blocking call and continue to read, no matter if it succeeded or not, because due to Config::save first truncating the file it's unlikely that you're going to successfully read a partially written file when the lock failed. I think I get the idea, but this first iteration is almost like not having a lock at all and in the end the outcome still depends on the operating system and file system behavior and should be avoided. If reading of the config file fails in the first iteration you try to obtain an exclusive lock with a blocking call instead of waiting for a shared lock with a blocking call. An exclusive lock can only be obtained if there is 0 exclusive and 0 shared locks active. A shared lock can only be obtained if there is 0 exclusive locks active. As you're waiting for Config::save to release its exclusive lock the condition under which either of the locks can be acquired here should be the same (Config::save releasing its lock) and a shared lock would allow multiple waiting processes to read the config simultaneously. By using an exclusive lock you artifically force them to read the config one after another. Is there some quirks to the flock implementation of php or on FreeBSD that require this? If not I would recommend just using a single blocking call to flock for a shared lock. The non-blocking (and thus possibly non-locking) call to flock in the first iteration might save a few milliseconds of response time if something calls Config::init in just the right moment during a Config::save in another process, but what you're getting from that is a config, that you technically already know is outdated (because the flock call failed), making the risk greater than the gain.

AdSchellevis commented 1 year ago

I don't have the toolchain setup to build the complete OPNsense at that revision, but I manually copied the Config.php from it onto our test system and the race condition when reading the config seems to have been resolved. There is however still a race when restoring the config. If you adjust the test script I provided before to write a faulty config file instead of restoring the config before releasing the lock you can still witness multiple processes performing the restore action. An exclusive lock for Config::restoreBackup would help with that.

That's true, but unlikely to happen from our codebase, but locking the restore operation is a good idea anyway to prevent multiple executions.

Could you elaborate on why you are first trying to obtain a shared lock in a non-blocking way and then use an exclusive lock in Config::loadFromStream ()...

That's quite simple, the file is often locked longer than the write operation takes, waiting for each lock will prevent parallel tasks to access the file while a frontend process locks the config. The truncate and write operation usually happens quite fast on modern systems, which is likely why we never hit this before ourselves.

... If reading of the config file fails in the first iteration you try to obtain an exclusive lock with a blocking call instead of waiting for a shared lock with a blocking call. An exclusive lock can only be obtained if there is 0 exclusive and 0 shared locks active. A shared lock can only be obtained if there is 0 exclusive locks active. .....

I was doubting a bit yesterday so I took the hammer, but I think you're right, the second attempt could likely use a LOCK_SH in stead of an exclusive lock.

...The non-blocking (and thus possibly non-locking) call to flock in the first iteration might save a few milliseconds of response time if something calls Config::init in just the right moment during a Config::save in another process, but what you're getting from that is a config, that you technically already know is outdated (because the flock call failed), making the risk greater than the gain....

I don't agree there as locks are allowed to take (much) longer, as an example take look at an alias import

https://github.com/opnsense/core/blob/fc8fb4e289ec140d7368a573d572a892c8e3be05/src/opnsense/mvc/app/controllers/OPNsense/Firewall/Api/AliasController.php#L401-L442

In the mean time other unrelated components can safely read their data which hasn't changed by the alias module.

Daggolin commented 1 year ago

I didn't realise the caller may already hold an exclusive lock when saving the file. This of course increases the amount of time saved when loading a page, but at the same time it - again - introduces a form of race. Let's take the AliasController::importAction you linked as an example. The moment the exclusive lock is acquired we know that the current config is no longer up to date, because (unless an error occurs during validation) the imported aliases are going to be written to the config. The config file may still be valid on disk during this preprocessing, but if you permit another process to read it you're permitting it to read outdated data. If that process, in turn, wants to write a change to the config it potentaially has the previous state loaded and is going to undo the changes of the first process unless there is some code in place to prevent this. For that reason I think a single blocking call for a shared lock is the safest way to go, we are talking about the system config after all.

AdSchellevis commented 1 year ago

...For that reason I think a single blocking call for a shared lock is the safest way to go, we are talking about the system config after all.

Let's agree to disagree on this, yes, the data may be tainted, but given the number of components utilizing the same file, the chances it could use the current data without issues are quite large. Only another session accessing the exact same object[s] would be at risk, which would also be the case if the user opens a page, waits, and another user changes the same data in between by the way (last one wins). (MVC components usually acquire a lock before reading to prevent races anyway)

A single shared lock will cause the gui to be sloppy in quite some scenario's leading to more unneeded support. Let's not forget the current approach has been in there for almost a decade, concurrency can't be perfect due to the nature of how data is stored, but it's consistent within reasonable limits.

I'm not sure why the previous issue happens very often in your scenario, probably different usage than most installations (a lot of api calls perhaps).

So let's wrap this up, I'm going to replace exclusive lock for a shared (blocking) lock and have a look at the restoreBackup procedure to see if we can add a lock. Given the constraints available, that currently is the best we can offer. Writing a single (large) file will always have disadvantages.

Daggolin commented 1 year ago

A single shared lock will cause the gui to be sloppy in quite some scenario's leading to more unneeded support. Let's not forget the current approach has been in there for almost a decade, concurrency can't be perfect due to the nature of how data is stored, but it's consistent within reasonable limits.

If exclusive locks are only obtained when making changes / writing changes to file it shouldn't cause the gui to be sloppy unless someone else is applying changes in that very moment. As I have only read config related code so far I may not have the full picture here though.

I'm not sure why the previous issue happens very often in your scenario, probably different usage than most installations (a lot of api calls perhaps).

I don't think our usage if very different to most installations. We have 3 uplinks connected via Ethernet (1x PPPoE; 1x public ip as static interface configuration; 1x private ip as static interface configuration). The internal networks are set up as VLANs that connect through an SFP+ port. There is about half a dozen IPsec tunnels and two OpenVPN clients configured. Logs are sent to a separate syslog server. No API use (other than whatever the system might automatically do on every installation). No own cron jobs. No plugins installed, not even a theme.

I think the original issue occurs more frequently in the wild than it might seem. The syslog entries for last year indicate it usually just went back by one or two configs. Considering the latest config backup is usually an exact copy of the current config.xml a rollback of one backup doesn't actually cause a change and the rollback of two is hardly noticable.

Something I mentioned in an earlier message was the fact that the dashboard automatically causes config reads in the background. This was again confirmed when I tested the Config.php from bebf3a2. The multiple calls to Config::restoreBackup when releasing the lock in my testscript (the modified version that keeps the config empty) were caused by simply having 3 tabs with the dashboard opened on another screen when running the testscript. Thinking back to last week's incident I believe I had tried to open tabs for Firewall and VPN > IPsec by pressing the middle mouse button on them. As those buttons are not links to actual pages, but rather the buttons to open a category, pressing the middle mouse button would've opened a copy of the current page at the time (Lobby > Dashboard in my case). Thus I believe I had at least two instances of the dashboard running in the background and the dupplicated triggering of the alias cloning lead to the horrible timing that would iterate over roughly 70 configs.

@fichtner, if I am not mistaking you handle the releases of OPNsense. As you added this issue to the 23.7 milestone, may I ask if there are any plans to backport the fix to the current 23.1 branch? I realise the devastating effct of this race condition may not have been consciously witnessed and reported by other users in years, but the possibility of the system silently reverting to an old config anytime a user makes any change to the system seems like a critical issue that should be fixed in the next minor release, even if the chance of the silent rollback happening may seem slim. It has happened in the wild at least once now.

fichtner commented 1 year ago

@Daggolin We don't have 23.1.x milestones so the nearest one is 23.7 as tagged. But most fixes are eligible for 23.1.x backports and this is no exception. The standard policy, however, is to give development version some slack skipping a minor so now we are on 23.1.7 and 23.1.9 is the next best release for this change unless there is reason to rush it (it doesn't appear so but @AdSchellevis can override if he conculdes the change is harmless/better to have sooner than later).

AdSchellevis commented 1 year ago

...I think the original issue occurs more frequently in the wild than it might seem. The syslog entries for last year indicate it usually just went back by one or two configs. Considering the latest config backup is usually an exact copy of the current config.xml a rollback of one backup doesn't actually cause a change and the rollback of two is hardly noticable...

You might be right, just didn't notice it myself, but that doesn't mean it didn't happen. I certainly do want to thank you for tracking this down.

...The standard policy, however, is to give development version some slack skipping a minor so now we are on 23.1.7 and 23.1.9 is the next best release for this change unless there is reason to rush it...

To be honest, I'm doubting a bit, knowing you could wreck a config by forcing enough reads kind of pleas for 23.1.8, but it also has been in here for quite a while without noticeable issues. @fichtner let's put this on our discussion list to weigh the pro's and cons.