Failing to inject Network Fault

gitRam18 commented 4 years ago

Describe the issue: I am trying mangle deployed in a openshift environment. I am trying to inject a network fault (packet delay or packet duplication) on a remote linux machine within my organizational network but the networkFault script is failing with an error as below:

com.vmware.mangle.utils.exceptions.MangleException: Precheck Failed with pre-requisites : tc is required, ip is required

The networkFault script got successfully transferred to the machine followed by successful SSH connection to execute the command. The command execution is failing.

Steps to reproduce:

Open mangle via browser, add a remote linux machine as End Point with its uid/pwd as credentials.
Go to Fault Injection --> Network Faults --> Packet Delay
Select remote linux machine as EndPoint, eth0 as Nic name, 1s packet delay, 3s time out
Run Fault.

Logs: 2020-04-24 16:25:07.640 [SystemResourceFaultTaskHelper-1587745499241] DEBUG com.vmware.mangle.task.framework.helpers.CommandInfoExecutionHelper.getAbsoluteCommand (195) - Absolute Command is /home/infra1/temp/networkFault.sh --operation=inject --faultOperation=NETWORK_DELAY_MILLISECONDS --latency=1000 --percentage=0 --nicName=eth0 --timeout=5000 2020-04-24 16:25:07.641 [SystemResourceFaultTaskHelper-1587745499241] INFO com.vmware.mangle.utils.clients.ssh.SSHUtils.runCommandReturningResult (156) - Running Command ... 2020-04-24 16:25:10.471 [SystemResourceFaultTaskHelper-1587745499241] DEBUG com.vmware.mangle.utils.clients.ssh.SSHUtils.runCommandReturningResult (165) - SSH Connected Successfully

2020-04-24 16:25:11.473 [SystemResourceFaultTaskHelper-1587745499241] DEBUG com.vmware.mangle.utils.clients.ssh.SSHUtils.runCommandReturningResult (169) - Command-output: Precheck Failed with pre-requisites : tc is required, ip is required

2020-04-24 16:25:11.474 [SystemResourceFaultTaskHelper-1587745499241] DEBUG com.vmware.mangle.utils.clients.ssh.SSHUtils.runCommandReturningResult (170) - exit-status: 127 2020-04-24 16:25:11.477 [SystemResourceFaultTaskHelper-1587745499241] ERROR com.vmware.mangle.services.tasks.executor.TaskExecutor.runTask (231) - Task Execution Failed. Reason: ErrorCode : FI0015, ErrorMessage : Execution of Command: /home/infra1/temp/networkFault.sh --operation=inject --faultOperation=NETWORK_DELAY_MILLISECONDS --latency=1000 --percentage=0 --nicName=eth0 --timeout=5000 failed. errorCode: 127 output: Precheck Failed with pre-requisites : tc is required, ip is required . com.vmware.mangle.utils.exceptions.MangleException: Precheck Failed with pre-requisites : tc is required, ip is required

Expected Behavior: Packet delay should have been injected on to the machine and the device eth0

Additional Info: I tried a similar script (networkFault.sh) on the machine directly and able to inject the fault. I used the same user credentials as entered in mangle. I also verified the tc & ip utilities are available on the machine and able to run them manually.

jayasankarr1990 commented 4 years ago

This looks strange.If the script ran successfully ,ideally it should run when triggered from mangle as well. Can you share the output when you run the script on the machine.The inputcommand and output

gitRam18 commented 4 years ago

Here is the output of the commands I ran on the same machine with same user and from the same folder as I mentioned in mangle to copy the script:

infra1@uklvadsb0257[DEV][~] $ cd temp

infra1@uklvadsb0257[DEV][temp] $ whoami infra1

infra1@uklvadsb0257[DEV][temp] $ pwd /home/infra1/temp

infra1@uklvadsb0257[DEV][temp] $ tc qdisc show qdisc noqueue 0: dev lo root refcnt 2 qdisc mq 0: dev eth0 root qdisc pfifo_fast 0: dev eth0 parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc noqueue 0: dev docker0 root refcnt 2 qdisc noqueue 0: dev veth608d37b root refcnt 2 qdisc noqueue 0: dev vethdc27dc7 root refcnt 2

infra1@uklvadsb0257[DEV][temp] $ tc qdisc add dev eth0 root netem delay 500ms RTNETLINK answers: Operation not permitted

infra1@uklvadsb0257[DEV][temp] $ sudo tc qdisc add dev eth0 root netem delay 500ms

infra1@uklvadsb0257[DEV][temp] $ tc qdisc show qdisc noqueue 0: dev lo root refcnt 2 qdisc netem 8018: dev eth0 root refcnt 5 limit 1000 delay 500.0ms qdisc noqueue 0: dev docker0 root refcnt 2 qdisc noqueue 0: dev veth608d37b root refcnt 2 qdisc noqueue 0: dev vethdc27dc7 root refcnt 2

infra1@uklvadsb0257[DEV][temp] $

djb4ai commented 4 years ago

@jayasankarr1990 I tried the same thing, I think the tc command is not found because /sbin is not in PATH when its run on the fly with ssh, when I ran $ssh username@ip "export PATH='/sbin:$PATH';tc" it worked fine.

aswathy-ramabhadran commented 4 years ago

@jayasankarr1990 I tried the same thing, I think the tc command is not found because /sbin is not in PATH when its run on the fly with ssh, when I ran $ssh username@ip "export PATH='/sbin:$PATH';tc" it worked fine.

@lladhibhutall thanks for the feedback. We will fix this. @gitRam18 could you confirm if the workaround suggested by @lladhibhutall works for you as well?

gitRam18 commented 4 years ago

@lladhibhutall and myself tried this together. Check if the path needs to be /usr/sbin Or /sbin whichever is more appropriate from the security perspective and let us know once you fixed.

One more request I have is:

Not to delete the log file and to keep it for verification post remediation.

Couple of questions I have are:

I have not tried but what if I give latency delay (ex: 10 sec) larger than timeout (5 sec) period?
Is it possible to defer the auto remediation and leave it to user's choice to trigger it via UI. May be some kind of alert on UI etc post the time out

aswathy-ramabhadran commented 4 years ago

@gitRam18 there is a reason for making auto remediation mandatory for network faults. If there is an unusually high value set for say latency, then remote access (using putty and other tools) to the system under test would start failing. You will have to know where the machine is deployed and will have to manually do a remediation. In such cases it is better to have an auto-remediation in place. Let me know if you disagree to this thought.

gitRam18 commented 4 years ago

I understand but my challenge is on how to verify whether the fault has been injected or not. Do you suggest any tools to capture results before they get auto remediated? As I mentioned, logs are also deleted at the end.

Could you please respond on my first question as well regarding latency > timeout and in that case how the latency is effective if you remediate automatically.

And one more, do you have the new package with path corrections for sbin?

Please advise

jayasankarr1990 commented 4 years ago

we will allow logs to be available after remediation and fix the sbin issue in the next release.Will update the bug once fixed. Regarding latency> timeout ,latency will be effective till timeout happens.

vhkilari commented 3 years ago

The issue is resolved with the 3.0.0 release. Please upgrade to 3.0.0 and verify.

vhkilari commented 3 years ago

The documentation to Mangle upgrade available at https://vmware-1.gitbook.io/mangle/mangle-administration/supported-deployment-models#upgrading-existing-instances-of-mangle

aswathy-ramabhadran commented 3 years ago

Closing the issue as fixed in 3.0.

vmware / mangle

Failing to inject Network Fault #33