Open pawanphalak opened 2 years ago
Some applications cache DNS results; if the results are cached before the chaos experiment is injected, you will not see the experiment's effects. How did you validate if the DNS error was successful or not? Will be good if you could share that shell script you used and also your chaos engine spec for the experiment
Thanks @gdsoumya for the response. I checked for the DNS cache,it is a simple nginx deployment with a cluster ip service named nginx. And I tested it with following script :
#!/bin/bash
while :
do
curl [nginx](http://google.com/) >> /tmp/outputtrace.log && sleep 0.01;
curl -LI [nginx](http://google.com/) -o /dev/null -w '%{http_code}\n' -s >> /tmp/outputstatus.log && sleep 0.01;
done
Following is the chaos engine spec for the experiment:
apiVersion: [litmuschaos.io/v1alpha1](http://litmuschaos.io/v1alpha1)
kind: ChaosEngine
metadata:
name: dns-error
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-dns-error-sa
jobCleanUpPolicy: retain
experiments:
- name: pod-dns-error
spec:
components:
env:
- name: CONTAINER_RUNTIME
value: containerd
## comma separated list of host names
## if not provided, all hostnames/domains will be targeted
- name: TARGET_HOSTNAMES
value: '["nginx"]'
- name: TOTAL_CHAOS_DURATION
value: '500'
Are you running the DNS chaos on the nginx pod or on some other pod that is accessing to nginx? Also maybe try using the fully hostname for the service like <svc-name>.<namespace>.svc.cluster.local
Running the DNS chaos on the same nginx pod. Tried using complete hostname as well, but didnt get any DNS error. I also validated with external hostname like google.com as target hostname but got all 200 responses.
If you run the experiment on the same nginx pod then it might not show the effect properly, you need to run it on the pod where you want the DNS requests to fail for example start a new pod and run the chaos on that and try using dig/curl in that pod to access nginx. Any other domain/host will not be affected because you mentioned target as just nginx so google.com will not be affected.
Also just confirming that you updated the chaos engine with the full service hostname?
We specify the app on which the chaos should run as following in the chaos engine spec right?
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
So in this case the expected behavior should be that we should see the DNS errors for the hostnames specified in TARGET_HOSTNAMES, when we try to run the curl from pods itself?
Also just confirming that you updated the chaos engine with the full service hostname? yes, I also tried removing the TARGET_HOSTNAMES variable completely(which should target all hostnames) but not able to validate the chaos.
which container env are you using? is it containerd?
yes containerd.
yes containerd.
containerd has had some issues with DNS chaos can you check the logs of the helper pod and confirm if there are any errors being reported in there.
The helper pod is getting deleted immediately. I tried this experiment multiple times, sometimes the helper pods also went in the error state but the final chaos result was still showing as passed. Is there any configuration to keep the helper pods running?
Is it getting deleted immediately as soon as the experiment starts?
yes
@gdsoumya , I just created a new cluster with docker runtime and the DNS chaos worked as expected. It looks like it has some issues with containerd. Also just wanted to confirm, does the chaos gets injected in any one of the pods, if we have multiple replica of the target application on which we are trying to perform chaos? My observation was that I saw DNS errors in only one pod out of the 2 pods present.
it should affect all pods as far as I know, can you set the pod affect percentage to 100% and see. Tagging @ispeakc0de for further support on pods affected.
For Pod DNS error litmus experiment, we followed the steps(https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-dns-error/#ramp-time) to generate a chaos for the target hostname(nginx), we also ran a shell script to validate if the chaos is injected. But we were not able to identify the chaos injection for the application pods, since the hostname DNS was still working during the entire duration of chaos.
We also wanted to debug this more by increasing the TOTAL_CHAOS_DURATION to a higher value(like 300 seconds), but even after increasing the chaos duration, the chaos experiment completes within 30-40 seconds. Can you please confirm if there is any other configuration we can use to increase chaos duration or if we can validate the chaos experiment? We also noticed the similar behavior for POD DNS Spoof experiment.