Pod DNS error and Pod DNS spoof litmus tests validations and TOTAL_CHAOS_DURATION issue

pawanphalak commented 2 years ago

For Pod DNS error litmus experiment, we followed the steps(https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-dns-error/#ramp-time) to generate a chaos for the target hostname(nginx), we also ran a shell script to validate if the chaos is injected. But we were not able to identify the chaos injection for the application pods, since the hostname DNS was still working during the entire duration of chaos.

We also wanted to debug this more by increasing the TOTAL_CHAOS_DURATION to a higher value(like 300 seconds), but even after increasing the chaos duration, the chaos experiment completes within 30-40 seconds. Can you please confirm if there is any other configuration we can use to increase chaos duration or if we can validate the chaos experiment? We also noticed the similar behavior for POD DNS Spoof experiment.

gdsoumya commented 2 years ago

Some applications cache DNS results; if the results are cached before the chaos experiment is injected, you will not see the experiment's effects. How did you validate if the DNS error was successful or not? Will be good if you could share that shell script you used and also your chaos engine spec for the experiment

pawanphalak commented 2 years ago

Thanks @gdsoumya for the response. I checked for the DNS cache,it is a simple nginx deployment with a cluster ip service named nginx. And I tested it with following script :

#!/bin/bash
while :
do
    curl [nginx](http://google.com/) >> /tmp/outputtrace.log && sleep 0.01;
    curl -LI [nginx](http://google.com/) -o /dev/null -w '%{http_code}\n' -s >> /tmp/outputstatus.log && sleep 0.01;
done

Following is the chaos engine spec for the experiment:

apiVersion: [litmuschaos.io/v1alpha1](http://litmuschaos.io/v1alpha1)
kind: ChaosEngine
metadata:
  name: dns-error
spec:
  engineState: "active"
  annotationCheck: "false"
  appinfo:
    appns: "default"
    applabel: "app=nginx"
    appkind: "deployment"
  chaosServiceAccount: pod-dns-error-sa
  jobCleanUpPolicy:  retain
  experiments:
  - name: pod-dns-error
    spec:
      components:
        env:
        - name: CONTAINER_RUNTIME
          value: containerd
        ## comma separated list of host names
        ## if not provided, all hostnames/domains will be targeted
        - name: TARGET_HOSTNAMES
          value: '["nginx"]'
        - name: TOTAL_CHAOS_DURATION
          value: '500'

gdsoumya commented 2 years ago

Are you running the DNS chaos on the nginx pod or on some other pod that is accessing to nginx? Also maybe try using the fully hostname for the service like <svc-name>.<namespace>.svc.cluster.local

pawanphalak commented 2 years ago

Running the DNS chaos on the same nginx pod. Tried using complete hostname as well, but didnt get any DNS error. I also validated with external hostname like google.com as target hostname but got all 200 responses.

gdsoumya commented 2 years ago

If you run the experiment on the same nginx pod then it might not show the effect properly, you need to run it on the pod where you want the DNS requests to fail for example start a new pod and run the chaos on that and try using dig/curl in that pod to access nginx. Any other domain/host will not be affected because you mentioned target as just nginx so google.com will not be affected.

Also just confirming that you updated the chaos engine with the full service hostname?

pawanphalak commented 2 years ago

We specify the app on which the chaos should run as following in the chaos engine spec right?

appinfo:
    appns: "default"
    applabel: "app=nginx"
    appkind: "deployment"

So in this case the expected behavior should be that we should see the DNS errors for the hostnames specified in TARGET_HOSTNAMES, when we try to run the curl from pods itself?

Also just confirming that you updated the chaos engine with the full service hostname? yes, I also tried removing the TARGET_HOSTNAMES variable completely(which should target all hostnames) but not able to validate the chaos.

gdsoumya commented 2 years ago

which container env are you using? is it containerd?

pawanphalak commented 2 years ago

yes containerd.

gdsoumya commented 2 years ago

yes containerd.

containerd has had some issues with DNS chaos can you check the logs of the helper pod and confirm if there are any errors being reported in there.

pawanphalak commented 2 years ago

The helper pod is getting deleted immediately. I tried this experiment multiple times, sometimes the helper pods also went in the error state but the final chaos result was still showing as passed. Is there any configuration to keep the helper pods running?

gdsoumya commented 2 years ago

Is it getting deleted immediately as soon as the experiment starts?

pawanphalak commented 2 years ago

yes

pawanphalak commented 2 years ago

@gdsoumya , I just created a new cluster with docker runtime and the DNS chaos worked as expected. It looks like it has some issues with containerd. Also just wanted to confirm, does the chaos gets injected in any one of the pods, if we have multiple replica of the target application on which we are trying to perform chaos? My observation was that I saw DNS errors in only one pod out of the 2 pods present.

gdsoumya commented 2 years ago

it should affect all pods as far as I know, can you set the pod affect percentage to 100% and see. Tagging @ispeakc0de for further support on pods affected.

litmuschaos / chaos-charts

Pod DNS error and Pod DNS spoof litmus tests validations and TOTAL_CHAOS_DURATION issue #564