sonic-net / sonic-utilities

Command line utilities for the SONiC project
Other
154 stars 650 forks source link

[GCU or caclmgrd] GCU CACL update might broke caclmgrd #2049

Closed wen587 closed 2 years ago

wen587 commented 2 years ago

Description

Case 1(described below): During continuous cacl test, caclmgrd may be broken and iptabels won't update. Case 2(during kvm test): There is a possibility that iptables is not immediately updated cuz caclmgrd UPDATE_DELAY_SECS=0.5. Refer test here: https://github.com/Azure/sonic-mgmt/blob/master/tests/generic_config_updater/test_cacl.py#L467 It will have a chance to fail which is already found in kvm nightly test. It check immediately after apply-patch success, while caclmgrd might upadte iptables in 0.5 seconds. Example syslog:

vlab-01 INFO caclmgrd[2235692]: ACL config not stable for namespace '': 2 changes detected in the past 0.5 seconds. Skipping update ...
///////@If we check iptables between these two move, the iptables check will fail as it have not been updated.@
vlab-01 INFO caclmgrd[2235692]: ACL config for namespace '' has not changed for 0.5 seconds. Applying updates ...

Two thoughts I am thinking:

  1. GCU only make sure configDB's table is updated. Caclmgrd subscribe configDB tables. Maybe we can make it to wait(sleep) 1 sec inside GCU to make sure caclmgrd does take effect and update iptables. Or we will need some communication between GCU and caclmgrd. I don't know if that is feasible.
  2. Make GCU apply the whole cacl change in one step. Just like config load loads a whole json file at one time. This way, caclmgrd may not have the update delay. In this way, Case 1 may also be fixed?

Steps to reproduce the issue

  1. Provide add and remove cacl config json file
    
    admin@vlab-01:~/cacl/test$ cat test11.json
    [
        {
            "op": "add",
            "path": "/ACL_RULE",
            "value": {
                "SSH_ONLY|TEST_DROP": {
                 "IP_TYPE": "IP",
                 "PRIORITY": "9998",
                 "PACKET_ACTION": "DROP",
                 "SRC_IP": "9.9.9.9/32"
                }
            }
        }
    ]
    admin@vlab-01:~/cacl/test$ cat test22.json
    [
        {
            "op": "remove",
            "path": "/ACL_RULE"
        }
    ]
2. Continuous run python script

admin@vlab-01:~/cacl/test$ cat bug_repro.py

import subprocess import time import datetime def run_script():

add cacl config

cmd = "sudo config apply-patch -i /FEATURE -i /QUEUE -i /SCHEDULER test11.json"
result = subprocess.Popen(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
_, _= result.communicate()

cmd = "sudo iptables -S"
result = subprocess.Popen(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
text, err = result.communicate()

if "9.9.9.9" not in text:
    x = datetime.datetime.now()
    print(x + " : fail to add")
    print(text)
    return "fail"

# remove cacl config
cmd = "sudo config apply-patch -i /FEATURE -i /QUEUE -i /SCHEDULER test22.json"
result = subprocess.Popen(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
_, _= result.communicate()

cmd = "sudo iptables -S"
result = subprocess.Popen(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
text, err = result.communicate()

if "9.9.9.9" in text:
    x = datetime.datetime.now()
    print(x + " : fail to remove")
    print(text)
    return "fail"

for i in range(1, 100): print("iteration"+str(i)) ret= run_script() if ret== "fail": break

3. caclmgrd being broke and iptables won't be updated

admin@vlab-01:~/cacl/test$ python verify.py iteration1 iteration2 ... iteration25 2022-01-29 03:58:09.802610 : fail to remove <====== -P INPUT ACCEPT -P FORWARD ACCEPT -P OUTPUT ACCEPT -A INPUT -s 127.0.0.1/32 -i lo -j ACCEPT -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT -A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT -A INPUT -p icmp -m icmp --icmp-type 0 -j ACCEPT -A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT -A INPUT -p icmp -m icmp --icmp-type 11 -j ACCEPT -A INPUT -p udp -m udp --dport 67:68 -j ACCEPT -A INPUT -p udp -m udp --dport 546:547 -j ACCEPT -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT -A INPUT -p tcp -m tcp --sport 179 -j ACCEPT -A INPUT -s 9.9.9.9/32 -p tcp -m tcp --dport 22 -j DROP <===== -A INPUT -d 10.1.0.32/32 -j DROP -A INPUT -d 10.250.0.0/32 -j DROP -A INPUT -d 192.168.0.1/32 -j DROP -A INPUT -d 10.0.0.56/32 -j DROP -A INPUT -d 10.0.0.58/32 -j DROP -A INPUT -d 10.0.0.60/32 -j DROP -A INPUT -d 10.0.0.62/32 -j DROP -A INPUT -m ttl --ttl-lt 2 -j ACCEPT -A INPUT -j DROP

#### Describe the results you received
Check syslog. And iptables won't be updated due to caclmgrd's failure

admin@vlab-01:~/cacl/test$ sudo cat /var/log/syslog | less Jan 29 03:58:04.375320 vlab-01 NOTICE config: Patch Applier: The patch was sorted into 5 changes: Jan 29 03:58:04.376123 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PACKET_ACTION"}] Jan 29 03:58:04.376229 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PRIORITY"}] Jan 29 03:58:04.376325 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/SRC_IP"}] Jan 29 03:58:04.376399 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/IP_TYPE"}] Jan 29 03:58:04.376473 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE"}] Jan 29 03:58:04.376601 vlab-01 NOTICE config: Patch Applier: Applying 5 changes in order: Jan 29 03:58:04.376695 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PACKET_ACTION"}] Jan 29 03:58:04.705395 vlab-01 INFO caclmgrd[2235692]: ACL change detected for namespace '' Jan 29 03:58:04.705636 vlab-01 INFO caclmgrd[2235692]: Spawning ACL update thread for namepsace '' ... Jan 29 03:58:05.056248 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PRIORITY"}] Jan 29 03:58:05.206276 vlab-01 INFO caclmgrd[2235692]: ACL config not stable for namespace '': 1 changes detected in the past 0.5 seconds. Skipping update ... Jan 29 03:58:05.707468 vlab-01 INFO caclmgrd[2235692]: ACL config not stable for namespace '': 2 changes detected in the past 0.5 seconds. Skipping update ... Jan 29 03:58:05.847169 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/SRC_IP"}] Jan 29 03:58:06.207341 vlab-01 INFO caclmgrd[2235692]: ACL config for namespace '' has not changed for 0.5 seconds. Applying updates ... Jan 29 03:58:06.217784 vlab-01 INFO caclmgrd[2235692]: Translating ACL rules for control plane ACL 'NTP_ACL' (service: 'NTP') Jan 29 03:58:06.217864 vlab-01 WARNING caclmgrd[2235692]: Unable to determine if ACL table 'NTP_ACL' contains IPv4 or IPv6 rules. Skipping table... Jan 29 03:58:06.217925 vlab-01 INFO caclmgrd[2235692]: Translating ACL rules for control plane ACL 'SNMP_ACL' (service: 'SNMP') Jan 29 03:58:06.218007 vlab-01 WARNING caclmgrd[2235692]: Unable to determine if ACL table 'SNMP_ACL' contains IPv4 or IPv6 rules. Skipping table... Jan 29 03:58:06.218104 vlab-01 INFO caclmgrd[2235692]: Translating ACL rules for control plane ACL 'SSH_ONLY' (service: 'SSH') Jan 29 03:58:06.218187 vlab-01 ERR caclmgrd[2235692]: rule_props for rule_id TEST_DROP does not have key 'PRIORITY'! Jan 29 03:58:06.218295 vlab-01 WARNING caclmgrd[2235692]: Unable to determine if ACL table 'SSH_ONLY' contains IPv4 or IPv6 rules. Skipping table... Jan 29 03:58:06.225719 vlab-01 INFO caclmgrd[2235692]: Traceback (most recent call last): Jan 29 03:58:06.225812 vlab-01 INFO caclmgrd[2235692]: File "/usr/local/bin/caclmgrd", line 866, in Jan 29 03:58:06.226357 vlab-01 INFO caclmgrd[2235692]: main() <========================= Jan 29 03:58:06.226406 vlab-01 INFO caclmgrd[2235692]: File "/usr/local/bin/caclmgrd", line 862, in main Jan 29 03:58:06.226692 vlab-01 INFO caclmgrd[2235692]: caclmgr.run() <========================= Jan 29 03:58:06.226734 vlab-01 INFO caclmgrd[2235692]: File "/usr/local/bin/caclmgrd", line 832, in run Jan 29 03:58:06.226985 vlab-01 INFO caclmgrd[2235692]: if self.config_db_map[namespace].get_table(self.ACL_TABLE)[acl_table]["type"] == self.ACL_TABLE_TYPE_CTRLPLANE: <========================= Jan 29 03:58:06.227040 vlab-01 INFO caclmgrd[2235692]: KeyError: 'SSH_ONLY' Jan 29 03:58:06.613009 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/IP_TYPE"}] Jan 29 03:58:07.304973 vlab-01 NOTICE config: Patch Applier: [{"op": "remove", "path": "/ACL_RULE"}] Jan 29 03:58:08.016759 vlab-01 NOTICE config: Patch Applier: Verifying patch updates are reflected on ConfigDB. Jan 29 03:58:08.909793 vlab-01 NOTICE config: Patch Applier: Patch application completed.

#### Describe the results you expected
Should not broke caclmgrd.

#### Additional information you deem important (e.g. issue happens only occasionally)

#### Output of `show version`

(paste your output here)



<!--
     Also attach debug file produced by `sudo generate_dump`
-->
ghooo commented 2 years ago

ran the script in the description, found the error happened on the first patch step:

Mar  8 19:05:55.610196 vlab-01 NOTICE config: Patch Applier: The patch was sorted into 4 changes:
Mar  8 19:05:55.610673 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE", "value": {"SSH_ONLY|TEST_DROP": {"IP_TYPE": "IP"}}}]
Mar  8 19:05:55.610760 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PRIORITY", "value": "9998"}]
Mar  8 19:05:55.611657 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PACKET_ACTION", "value": "DROP"}]
Mar  8 19:05:55.611757 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/SRC_IP", "value": "9.9.9.9/32"}]
Mar  8 19:05:55.611865 vlab-01 NOTICE config: Patch Applier: Applying 4 changes in order:
Mar  8 19:05:55.611949 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE", "value": {"SSH_ONLY|TEST_DROP": {"IP_TYPE": "IP"}}}]
Mar  8 19:05:55.975080 vlab-01 INFO caclmgrd[130303]: **** ACL TABLE has 0 entries. 
Mar  8 19:05:55.977561 vlab-01 INFO caclmgrd[130303]: **** SSH_ONLY not found. Try again.
Mar  8 19:05:55.984697 vlab-01 INFO caclmgrd[130303]: Traceback (most recent call last):
Mar  8 19:05:55.988131 vlab-01 INFO caclmgrd[130303]:   File "/usr/local/bin/caclmgrd", line 836, in run
Mar  8 19:05:55.988991 vlab-01 INFO caclmgrd[130303]:     if acl_table_content[acl_table]["type"] == self.ACL_TABLE_TYPE_CTRLPLANE:
Mar  8 19:05:55.989041 vlab-01 INFO caclmgrd[130303]: KeyError: 'SSH_ONLY'
Mar  8 19:05:55.989085 vlab-01 INFO caclmgrd[130303]: During handling of the above exception, another exception occurred:
Mar  8 19:05:55.989109 vlab-01 INFO caclmgrd[130303]: Traceback (most recent call last):
Mar  8 19:05:55.989149 vlab-01 INFO caclmgrd[130303]:   File "/usr/local/bin/caclmgrd", line 876, in <module>
Mar  8 19:05:55.992773 vlab-01 INFO caclmgrd[130303]:     main()
Mar  8 19:05:55.993474 vlab-01 INFO caclmgrd[130303]:   File "/usr/local/bin/caclmgrd", line 872, in main
Mar  8 19:05:55.994981 vlab-01 INFO caclmgrd[130303]:     caclmgr.run()
Mar  8 19:05:55.995057 vlab-01 INFO caclmgrd[130303]:   File "/usr/local/bin/caclmgrd", line 840, in run
Mar  8 19:05:55.995105 vlab-01 INFO caclmgrd[130303]:     if self.config_db_map[namespace].get_table(self.ACL_TABLE)[acl_table]["type"] == self.ACL_TABLE_TYPE_CTRLPLANE:
Mar  8 19:05:55.995155 vlab-01 INFO caclmgrd[130303]: KeyError: 'SSH_ONLY'
Mar  8 19:05:56.318380 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PRIORITY", "value": "9998"}]
Mar  8 19:05:57.136246 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PACKET_ACTION", "value": "DROP"}]
Mar  8 19:05:57.790636 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/SRC_IP", "value": "9.9.9.9/32"}]
ghooo commented 2 years ago

From the logs

Jan 29 03:58:06.218187 vlab-01 ERR caclmgrd[2235692]: rule_props for rule_id TEST_DROP does not have key 'PRIORITY'!

It seems PRIORITY is required.

ghooo commented 2 years ago

from caclmgr:

Why caclmgr crashes?

It crashes because during the update was divided into smaller steps, the first step was adding the IP_TYPE without the PRIORITY nor the PACKET_ACTION.

Why caclmgr only crashes after a lot of tries, and does not crash immediately?

Because caclmgr waits for 0.5s after each update for other updates. During this time we usually manage to also push PRIORITY and PACKET_ACTION. We have to try multiple times to get the failure, because we need a case where the time between pushing IP_TYPE and the other properties is > 0.5s.