litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
https://litmuschaos.io
Apache License 2.0
4.34k stars 681 forks source link

Network-loss doesn`t work as expected in multicontainer pod #4750

Open jan-machacek-kosik opened 1 month ago

jan-machacek-kosik commented 1 month ago

network-loss.zip

What happened: I have a pod with one main container and three sidecars. When network loss is applied to the custom sidecar container and the destination host, all network connectivity in the pod is lost.

What you expected to happen: I expect traffic loss to affect only connections from the targeted sidecar container to my otel-collector K8s service

How to reproduce it (as minimally and precisely as possible): this is env values for this experiment:

  - name: TARGET_CONTAINER
    value: otel-agent
  - name: LIB_IMAGE
  - name: NETWORK_PACKET_CORRUPTION_PERCENTAGE
    value: "100"
  - name: TOTAL_CHAOS_DURATION
    value: "600"
  - name: CONTAINER_RUNTIME
    value: containerd
  - name: DESTINATION_HOSTS
    value: dev-collector.otel-collector.svc.cluster.local
  - name: DEFAULT_HEALTH_CHECK
    value: "false"
  - name: SEQUENCE
    value: parallel

So I expected that the connection from the container otel-agent to dev-collector.otel-collector.svc.cluster.local would be disabled, but all other connections from all pods to any endpoint would be enabled. However, when this experiment is running, every connection from all pods is disabled, causing the readiness probe to fail.

When I investigated how this experiment works, I realized that this command is applied:

sudo nsenter -t 561580 -n tc qdisc replace dev eth0 root handle 1: prior   
sudo nsenter -t 561580 -n tc qdisc replace dev eth0 parent 1:3 netem loss 100   
sudo nsenter -t 561580 -n tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 10.0.30.140 flowid 1:3

It looks like only the connection to 10.0.30.140 is closed, that is correct. But in real experiment every connection outside of pod is disabled. For example, sidecar with proxysql container is not allowed to connect to databasse.

Anything else we need to know?: I run this experiment on AKS cluster. Kubernetes version: 1.29.2 Limus helm targetRevision: 3.8.0. Manifest of the experiment is attached.

jan-machacek-kosik commented 1 month ago

Further investigation revealed, that the problem affects only the sidecar with proxysql container, other outgoing connection works.