fix: reload on resetting to defaults

BrennanPaciorek commented 1 year ago

Enhancement: Make resetting to defaults reload instead of restart firewalld

Reason: Reloading in firewalld should successfully complete the configuration reset, and restarting adds downtime which can be used to open a connection that persists after firewalld has finishes restarting; this connection can be used to bypass firewall rules, since firewalld will not block traffic from active connections.

Result: Minimal downtime when using previous: replaced

Addresses an issue brought up in #140, where due to the restart on resetting to defaults, the feature may not be suitable for production environments.

codecov[bot] commented 1 year ago

Codecov Report

Patch and project coverage have no change.

Comparison is base (e8e8769) 53.62% compared to head (29a1126) 53.62%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #159 +/- ## ======================================= Coverage 53.62% 53.62% ======================================= Files 2 2 Lines 800 800 ======================================= Hits 429 429 Misses 371 371 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

richm commented 1 year ago

how can we test this?

BrennanPaciorek commented 1 year ago

how can we test this?

If you mean verifying that this will not introduce bugs into the feature, we could use firewall-cmd with ansible.builtin.command. This would involve:

Using the current implementation of previous: replaced to reset the firewall configuration
getting some representation of the active firewall implementation (something like a hash of nft list ruleset)
restarting firewalld service instead of reloading
get the representation of the active firewall implementation again
Compare the two stored values If nft list ruleset output is less predictable than I think, testing it may be more difficult than this.

If you mean testing minimal downtime, systemctl reload firewalld uses firewalld's reload implementation to reload configuration from files and systemctl restart firewalld turns the firewall off and back on (leaving the firewall off for a few seconds, which is usually a bad idea for a production environment).

BrennanPaciorek commented 1 year ago

If nft list ruleset output is less predictable than I think, testing it may be more difficult than this.

It appears that restart and reload produce functionally identical configurations, but the rules tend to be reordered slightly within the same rule chains, so I'll need to find a more accurate way to compare two configurations.

richm commented 1 year ago

how can we test this?

If you mean verifying that this will not introduce bugs into the feature, we could use firewall-cmd with ansible.builtin.command. This would involve:
* Using the current implementation of `previous: replaced` to reset the firewall configuration

* getting some representation of the active firewall implementation (something like a hash of `nft list ruleset`)

* restarting firewalld service instead of reloading

* get the representation of the active firewall implementation again

* Compare the two stored values
  If `nft list ruleset` output is less predictable than I think, testing it may be more difficult than this.
If you mean testing minimal downtime, systemctl reload firewalld uses firewalld's reload implementation to reload configuration from files and systemctl restart firewalld turns the firewall off and back on (leaving the firewall off for a few seconds, which is usually a bad idea for a production environment).

I mean the latter - check that the fix ensures minimal downtime - is there a way we can automate this - if not, how do we tell QE what to test?

BrennanPaciorek commented 1 year ago

is there a way we can automate this

Testing service downtime (or uptime) wouldn't work, because with reload, the service technically does not stop.

An idea would to be change a zone to DROP (or set a ICMP block) and place an interface on this zone, then spam a set number of packets (or more accurately, play them at a set interval) that would be dropped under this configuration. If the packets get sent back, it can be taken to mean the firewall is down, if they are lost, you can take it to mean the firewall is up and dropping the packets. Whichever loses the most packets can be understood as having the least downtime.

or one can up a service like a simple http server and block it with the firewall, and do a similar test.

Does this seem like a valid test?

richm commented 1 year ago

is there a way we can automate this

Testing service downtime (or uptime) wouldn't work, because with reload, the service technically does not stop.

An idea would to be change a zone to DROP (or set a ICMP block) and place an interface on this zone, then spam a set number of packets (or more accurately, play them at a set interval) that would be dropped under this configuration. If the packets get sent back, it can be taken to mean the firewall is down, if they are lost, you can take it to mean the firewall is up and dropping the packets. Whichever loses the most packets can be understood as having the least downtime.

or one can up a service like a simple http server and block it with the firewall, and do a similar test.

Does this seem like a valid test?

They both seem like valid tests - by "simple http server", maybe nc -l -p 80 -o file? then we could look at file and see if packets were dropped?

BrennanPaciorek commented 1 year ago

They both seem like valid tests - by "simple http server", maybe nc -l -p 80 -o file? then we could look at file and see if packets were dropped?

nc would be better for mimicking production environments, while ping is probably a more convenient tool for getting the necessary metrics (%packet loss, estimated downtime)

One issue so far in implementing this in a tox-friendly manner is that firewalld accepts outbound traffic on the loopback interface before evaluating policies (where outbound traffic is blocked), and disabling this rule takes some time (~100ms), which may result in more "downtime" than a production environment (or even a test environment with less restrictions), where this limitation of needing to use the loopback interface for tests does not exist.

The other issue with these tests is that using previous: replaced feature would undo the firewall rules needed to test the effectiveness of the modified feature. Would comparisons in downtime between systemctl reload firewalld.service and systemctl restart firewalld.service suffice for testing this? If not, we can use a modified reset script to re-add any required rules before restarting or reloading firewalld.

richm commented 1 year ago

They both seem like valid tests - by "simple http server", maybe nc -l -p 80 -o file? then we could look at file and see if packets were dropped?

nc would be better for mimicking production environments, while ping is probably a more convenient tool for getting the necessary metrics (%packet loss, estimated downtime)

One issue so far in implementing this in a tox-friendly manner is that firewalld accepts outbound traffic on the loopback interface before evaluating policies (where outbound traffic is blocked), and disabling this rule takes some time (~100ms), which may result in more "downtime" than a production environment (or even a test environment with less restrictions), where this limitation of needing to use the loopback interface for tests does not exist.

The other issue with these tests is that using previous: replaced feature would undo the firewall rules needed to test the effectiveness of the modified feature. Would comparisons in downtime between systemctl reload firewalld.service and systemctl restart firewalld.service suffice for testing this?

Yes, I think that would be fine.

If not, we can use a modified reset script to re-add any required rules before restarting or reloading firewalld.

BrennanPaciorek commented 1 year ago

Okay I've implemented the described test in a way that is compatible with the integration tests, I can move this to another file, but the test takes some time and probably shouldn't be run every PR, since its not testing any code prone to change, just highlighting the security difference made with the one line change

richm commented 1 year ago

[citest]

BrennanPaciorek commented 1 year ago

Locally, I can't seem to reproduce the same error that is showing on these CI tests. Maybe it has to do with the order in which the tests are being run?

richm commented 1 year ago

Locally, I can't seem to reproduce the same error that is showing on these CI tests. Maybe it has to do with the order in which the tests are being run?

Yes, that must be what it is

richm commented 1 year ago

[citest]

BrennanPaciorek commented 1 year ago

Modified test one more time, as some platforms running the test incorrectly reported packet loss due to the timeout I had set being too short.

richm commented 1 year ago

[citest]

linux-system-roles / firewall

fix: reload on resetting to defaults #159

Codecov Report