projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

From 3.27 to upgrade 3.28, IPPool Issue #9100

Open kkbruce opened 3 months ago

kkbruce commented 3 months ago

We reference upgrade docs (uses the operator) to upgrade the Calico version from 3.27 to 3.28.

$ calicoctl version
Client Version:    v3.28.1
Git commit:        601856343
Cluster Version:   v3.28.1
Cluster Type:      typha,kdd,k8s,operator,bgp,kubeadm,win

Expected Behavior

We can update it back to disabled: true or delete the old default-ipv4-ippool configuration.

Current Behavior

At 3.27, we set up a new IPPool according to the document and have already set disabled: true and it was working fine. However, after upgrading to 3.28, we found that the original disabled: true was reset to false, and we cannot update it back to true or delete the old default-ipv4-ippool configuration as described in the steps below under "Steps to Reproduce".

Possible Solution

Is it possible to have a downgraded restore file or steps, so that when there is a problem with the upgrade, it can be quickly repaired to a normal working version or state?

Steps to Reproduce (for bugs)

$ calicoctl get ippool -o wide
NAME                  CIDR             NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR
default-ipv4-ippool   192.168.0.0/16   true   Never      CrossSubnet   false      false              all()
new-pool              10.244.0.0/16    true   Never      CrossSubnet   false      false              all()

### add disabled: true
$ calicoctl get ippool -o yaml > pools.yaml
$ vim pools.yaml
$ calicoctl apply -f pools.yaml
Successfully applied 2 'IPPool' resource(s)
$ calicoctl get ippool -o wide
NAME                  CIDR             NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR
default-ipv4-ippool   192.168.0.0/16   true   Never      CrossSubnet   false      false              all()
new-pool              10.244.0.0/16    true   Never      CrossSubnet   false      false              all()

### delete default-ipv4-ippool
$ calicoctl delete pool default-ipv4-ippool
Successfully deleted 1 'IPPool' resource(s)
$ calicoctl get ippool -o wide
NAME                  CIDR             NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR
default-ipv4-ippool   192.168.0.0/16   true   Never      CrossSubnet   false      false              all()
new-pool              10.244.0.0/16    true   Never      CrossSubnet   false      false              all()

Context

The original default 192.168.x.x network segment conflicted with other internal network segments, causing abnormal access to the 192.168.x.x services of the Pod containers in the internal network. Therefore, the default value was modified to 10.244.x.x, and after disabled: true by default-ipv4-ippool, the entire network access became normal.

Your Environment

kkbruce commented 3 months ago

Use the command to get the same result.

$ calicoctl patch ippool default-ipv4-ippool -p '{"spec": {"disabled": true}}'

$ calicoctl get ippool -o wide
NAME                  CIDR             NAT    IPIPMODE   VXLANMODE     DISABLED   DISABLEBGPEXPORT   SELECTOR
default-ipv4-ippool   192.168.0.0/16   true   Never      CrossSubnet   false      false              all()
new-pool              10.244.0.0/16    true   Never      CrossSubnet   false      false              all()
kkbruce commented 3 months ago

Do I need to delete all Pods?

Currently, it can be confirmed that no Pods are using the IP address 192.168.x.x .

$ calicoctl ipam show --show-blocks
+----------+------------------+-----------+------------+--------------+
| GROUPING |       CIDR       | IPS TOTAL | IPS IN USE |   IPS FREE   |
+----------+------------------+-----------+------------+--------------+
| IP Pool  | 192.168.0.0/16   |     65536 | 0 (0%)     | 65536 (100%) |
| IP Pool  | 10.244.0.0/16    |     65536 | 48 (0%)    | 65488 (100%) |
| Block    | 10.244.167.0/26  |        64 | 4 (6%)     | 60 (94%)     |
| Block    | 10.244.28.192/26 |        64 | 42 (66%)   | 22 (34%)     |
| Block    | 10.244.58.192/26 |        64 | 2 (3%)     | 62 (97%)     |
+----------+------------------+-----------+------------+--------------+
kkbruce commented 3 months ago

$ sudo calicoctl node diags fils:

diags-20240806_213443.tar_20240806214137.gz

caseydavenport commented 3 months ago

@kkbruce in Calico v3.28, the operator has been updated to reconcile changes to IP pools.

If your IP pool is defined within your Installation, the operator will attempt to make sure that the actual IP pool in the cluster matches the one in your Installation. I suspect that is what is happening here.

If you don't want to use the 192.168.0.0 IP pool, you should just be able to delete it (from the Installation) - unless you want it for other reasons like NAT?

kkbruce commented 3 months ago

Due to the need to quickly restore the Calico CNI network to a functional state, we operated to downgrade to version 3.27. Currently, there is no temporary environment available for more information on version 3.28.

From another perspective, we referred to the migrate-pools document. In the migrate-pools document before version 3.27, there was no mention of operations such as Operator (kubectl edit installation default). Therefore, in version 3.28, we need to become more familiar with the Yaml configuration of the Installation itself and dare keep the same settings the same. However, from the information provided above, it can be seen that the Manifest operation of migrate-pools in version 3.28 could be more effective.