google / testrun

A tool to automate verification of network-based device behavior
Apache License 2.0
28 stars 9 forks source link

security.services.http Sporadic Missing HTTP Requests #885

Open duncangreene opened 4 weeks ago

duncangreene commented 4 weeks ago

Describe the bug Sporadically (haven't been able to deduce how often), some HTTP Requests (I think just HTTP 1.1) are seen missing from the services.pcap.

Image

This causes a false "No HTTP server found" to be recorded in the report, despite the DUT running a HTTP server.

Image

To Reproduce Steps to reproduce the behavior:

  1. Run services test.

Expected behavior All HTTP Requests should be present.

Image

Error logs allhttprequests.zip missinghttprequests.zip

Environment (please provide the following information about your setup):

Additional context DUT is https://www.ismacontrolli.com/en/isma-b-mix18-ip.html

jhughesbiot commented 4 weeks ago

The presence of HTTP traffic does not indicate an HTTP server running on the device. For that first example you provided, the traffic detected is to port 502, which is modbus traffic. I confirmed modbus traffic is present on the report in the protocol tests.

In the second example, confirming with the report, it does detect the HTTP server being open. This could be a timing issue with the device in question and the testing steps. i.e., example one tested services module before the device was in a steady state where the web server was ready vs the second where it was.

Any additional steps for testing to produce this would be helpful in determining the root cause.

duncangreene commented 4 weeks ago

Aha, I hadn't registered the port number in the first 3 telegrams.

The issue appears to be the absence of the outgoing requests from Testrun (10.10.10.9) to port 80 in the first example. What would cause Testrun to not send these outgoing requests?

jhughesbiot commented 4 weeks ago

The services module uses an nmap scan of the ports of interest so there should be no difference between one run and the next. I suspect the device is not responding to that process when you are seeing this which would indicate a different testing procedure sequence between the tests (stable state vs non-stable state) or inconsistent behavior of the device. Without having the full logs and testing procedure for each outcome, it's difficult to say exactly what the reason for this result is.

duncangreene commented 4 weeks ago

The original post contains the two test .zip outputs, if the log files you're after are included in there? Let me know if there are other logs etc. to include.

I can see from the services.pcap in missinghttprequests.zip that the DUT doesn't SYN ACK to the port 80 SYN, however this device does indeed run a bona fide web server as part of its 'normal' operation (and as evidenced in the respective services.pcap in allhttprequests.zip).

My first (late night) thought, with an eye on manufacturers potentially implementing minimum viable resources in embedded devices such as these to service 'normal' communications, is that maybe it's struggling to service this quantity or type of (potentially not-normal) communication at once? Are there any configurable rate-limits to testrun available? Akin to the timing templates offered by nmap? My first thought is to slow the outgoing Testrun communications down a bit, to see whether that delivers the expected result 100% of the time?

I've tested this same device a handful of times in Testrun v1.4, I will double check tomorrow whether I saw same false negatives in those reports too.

duncangreene commented 4 weeks ago

Alas, max_device_reports = 5 has rolled over all but one of my v1.4 reports, and that report was for an NTP-only test. I did find another v1.4 report for this device in my archives, and that showed a true positive as expected.

Image

Just for added info, the other 2 reports I've done in v2.0 showed a false negative and true positive (so a 50% hit rate across 4 tests done in v2.0 so far). I appreciate Testrun can only report what it sees (and if a web server doesn't respond first time then it might be okay to assume it's not running one), but I do wonder whether it's worth investigating a slower scan and/or a retry mechanism maybe?

I believe nmap implements some sort of dynamic timing model based on RTT's (probably more to cater for network latency which should be negligible in the case of a Testrun environment), but whether that's fully leveraged in Testrun? I'm unsure if nmap is used in isolation in the different tests (i.e. the dynamic timing 'learnt' by nmap running within the Protocol module is unknown to nmap running within the Services module). Spitballing here a bit! 😅

The other option might be to implement a retry mechanism, maybe with exponential backoff, but I imagine this might clash with or delay other tests.

jhughesbiot commented 3 weeks ago

You can change the max_device_reports to zero to prevent rollover and disable the report limit per device.

nmap is run within the services module and the results of that scan are what is used to determine the results.

Can you elaborate on your testing procedure? i.e., are you fully powering your device down between each test or are you just re-testing multiple times with the device powered on the whole time? I would like to eliminate difference in device state and testing procedure as a possible issue.

duncangreene commented 3 weeks ago

I can confirm that the device is/was fully powered down between each test. Device is only powered on once 'Waiting for Device' is displayed.

There are no other external influences to the device at all during the test either (e.g. there's no HMI that could be tinkered with or temp sensors connected to the inputs, etc.). DUT is directly connected to onboard Ethernet adapter of Testrun machine (no switch etc. in between).

duncangreene commented 3 weeks ago

A potential development...

As mentioned above, in line with the below I was using the onboard Ethernet adapter on my Testrun machine to connect to the DUT in the aforementioned tests.

Image

I swapped the interfaces round this evening so that a USB Ethernet adapter was connected to the DUT, and so far 5 out of 5 tests have all returned true positives. This would echo the results I saw in v1.4, where I'm quite sure I only ever used a USB Ethernet adapter to connect to the DUT.

I'll run a few more tests using USB Ethernet over the next few days, and I can swap back to using onboard Ethernet to see if that causes the problem to come back again, but so far the signs seem to be pointing to using the onboard Ethernet adapter to connect to the DUT as causing the problem. What's strange is that only the HTTP test seemed to exhibit unreliable results, however this device doesn't have any other 'active' services (that could trigger false negatives) so unsure whether there's any weight to this particularly.

Presumably there is a reason for allowing use of only 1 onboard Ethernet adapter, and not 2, what's the background there? Why the need for USB Ethernet adapter(s) specifically? I'd be interested to know the technicals on this if poss!

jboddey commented 3 weeks ago

To increase the accuracy of the http service test (and potentially other port tests) we should utilize nmap's rate limiting arguments to avoid port scan detection in devices and devices with low processing capacity.

duncangreene commented 3 weeks ago

Presumably there is a reason for allowing use of only 1 onboard Ethernet adapter, and not 2, what's the background there? Why the need for USB Ethernet adapter(s) specifically? I'd be interested to know the technicals on this if poss!

@jboddey @jhughesbiot Did you have any background on the USB adapter vs. onboard adapter above?

jboddey commented 3 weeks ago

Presumably there is a reason for allowing use of only 1 onboard Ethernet adapter, and not 2, what's the background there? Why the need for USB Ethernet adapter(s) specifically? I'd be interested to know the technicals on this if poss!

@jboddey @jhughesbiot Did you have any background on the USB adapter vs. onboard adapter above?

I'm not yet sure on the reason for differing results. However, the usb adapter is required for the device connection as it allows us to control the power to the adapter to simulate a port toggle event for testing.

duncangreene commented 2 weeks ago

@jboddey Aha, so the USB Ethernet adapter specifically has to be used to connect to the DUT, and the onboard Ethernet adapter (if used) must only be used for the Internet connection? If this is the case this may warrant an update to the documentation, as in the past I'd been using the two interchangeably.

Am I right in thinking the Ethernet adapter power cycle is done as part of the connection.dhcp_disconnect test? If so, I would add that when using onboard Ethernet to connect to the DUT, no Errors appeared in the report at least from a connection.dhcp_disconnect PoV. I can raise this as a separate issue if this behaviour isn't expected?

jboddey commented 2 weeks ago

@duncangreene I will definitely update the docs to state that the USB ethernet should be used for the DUT. Could you send over the ZIP file for an attempt where you have used the onboard adapter for the DUT and received no errors?

duncangreene commented 2 weeks ago

@jboddey the two .zips in the OP were both carried out with onboard Ethernet adapter connected to the DUT, and neither reported an Error result for connection.dhcp_disconnect in the PDF report at least.

jboddey commented 2 weeks ago

@jhughesbiot Can you confirm if the dhcp_disconnect test should work with onboard ethernet? My understanding was that we were toggling the power to the USB adapter.

duncangreene commented 2 weeks ago

@jhughesbiot Can you confirm if the dhcp_disconnect test should work with onboard ethernet? My understanding was that we were toggling the power to the USB adapter.

@jboddey I can confirm that I do not see the USB Ethernet adapter lose power during the connection.dhcp.disconnect test (the link lights on both USB Ethernet adapter and DUT remain on throughout). I have a video recording of this that I can supply if needed. This may warrant spinning out into its own separate issue if it is not the expected behaviour.

Oct 30 11:17:52 test_connection INFO    Running connection.dhcp.disconnect
Oct 30 11:17:52 test_connection INFO    Successfully resolved iface status
Oct 30 11:17:52 test_connection INFO    Resolving current lease with max wait time of 60 seconds
Oct 30 11:17:52 test_connection INFO    DHCP lease resolved from primary server
Oct 30 11:17:52 test_connection INFO    DHCP Lease resolved:
Oct 30 11:17:52 test_connection INFO    Current device lease resolved
Oct 30 11:17:52 test_connection INFO    Lease IP Resolved: 10.10.10.15
Oct 30 11:17:52 test_connection INFO    Attempting to ping device...
Oct 30 11:17:52 test_connection INFO    Ping Success: True
Oct 30 11:17:52 test_connection INFO    Current lease confirmed active in device
Oct 30 11:17:52 test_connection INFO    Device interface set to down state
Oct 30 11:17:52 test_connection INFO    Time until lease expiration: 15.662215
Oct 30 11:17:52 test_connection INFO    Waiting for current lease to expire: 2024-10-30 11:18:03+00:00
WARNING: Socket <scapy.arch.linux.L2ListenSocket object at 0x7fa964b85f00> failed with '[Errno 100] Network is down'. It was closed.
Oct 30 11:18:08 test_connection INFO    Current lease expired.
Oct 30 11:18:08 test_connection INFO    Waiting 10 seconds before bringing iface back up
Oct 30 11:18:18 test_connection INFO    Device interface set to up state
Oct 30 11:18:18 test_connection INFO    Resolving current lease with max wait time of 60 seconds
Oct 30 11:18:33 test_connection INFO    DHCP lease resolved from primary server
Oct 30 11:18:33 test_connection INFO    DHCP Lease resolved:
Oct 30 11:18:33 test_connection INFO    Lease IP Resolved: 10.10.10.15
Oct 30 11:18:33 test_connection INFO    Attempting to ping device...
Oct 30 11:18:33 test_connection INFO    Ping Success: True
Oct 30 11:18:33 test_connection INFO    Current lease confirmed active in device
Oct 30 11:18:33 test_connection INFO    Running connection.dhcp.disconnect_ip_change
duncangreene commented 2 weeks ago

I'll run a few more tests using USB Ethernet over the next few days...

I can confirm that I have now seen the false negative security.services.http test occur whilst using USB Ethernet adapter attached to the DUT too. Corresponding report.zip available below.

report (12).zip

jhughesbiot commented 2 weeks ago

A USB adapter should not be necessary for the DHCP disconnect tests. The power is not shut off to the adapter/port, it is only set to a down state via the command:

ip link set dev <iface_name> down

The documentation on this topic likely needs to be updated as there should be no hard requirement for built in vs USB adapters as the system does not do anything different during testing no matter what is being used.

Reviewing the latest report, I do see the lack of an HTTP port 80 being detected and inspecting the traffic, I can also see Testrun via nmap attempt to communicate to the device on port 80 and not getting expected port open responses.

It also appears the device was not powered down before the test was started as it was detected in less than 1 second of the network being ready. I'm not familiar with this specific device so just some thoughts on what could be the cause here we could look at:

Is there any chance this HTTP server was not available or disabled for some reason?
Is there some sort of steady state it needs to get to before it becomes available?
Does it detect it on a second or third test attempt without power cycles in between tests? Does keeping the device powered up longer than the current monitor period from a powered down state allow this stable state to be reached?

duncangreene commented 1 week ago

It also appears the device was not powered down before the test was started as it was detected in less than 1 second of the network being ready.

An excerpt from the video mentioned above, this is the powering up of the DUT around 10 seconds after the "Waiting for Device - Please connect and power on your device" message appears. The DUT can be powered via USB and you'll hear my other laptop detect and power it around 19 seconds into the clip.

https://github.com/user-attachments/assets/d6e7b825-a8cf-4ae5-9f20-a961323c2260

With an eye on your note about the device being detected in less than 1 second of the network being ready, I would add, it does appear to be a good while after the user is prompted to "Please connect and power on your device" before the "Network is ready" log message is produced (around 40 seconds on this machine), is this expected?

https://github.com/user-attachments/assets/1d757288-00b7-4767-a55a-c3d1862334ab

I will see if I can investigate answers to the questions, although I wonder whether leaving this on a recursive extended Testrun soak test (maybe specifically against security.services.http) might be worthwhile. @jboddey is this possible (and/or do one of the team maybe have the DUT and a suitable environment available at FR) to conduct if possible? Or whether we potentially seek to leverage nmap's rate limiting for a start, to see if that alleviates the false negatives?

jboddey commented 1 week ago

Initially I would like to try rate limiting nmap. It is important that test attempts do not take a long amount of time where possible.