google / testrun

A tool to automate verification of network-based device behavior
Apache License 2.0
26 stars 9 forks source link

Consider delaying the start of testing by 30 seconds #933

Open duncangreene opened 2 days ago

duncangreene commented 2 days ago

What is the problem your feature is trying to solve? Access layer switchports configured without a feature such as portfast will take around 30 seconds to come up whilst the switchport goes through the listening and learning STP states. During this time, a poorly implemented end device could give up if the DHCP server was unreachable on power on, and potentially revert to APIPA or similar. Whether the end device continues to send DHCP Discover messages following APIPA or similar could be device-specific.

Describe the solution you think would solve the problem Testing, other than that proposed in #932, should be delayed until emulated listening and learning STP states have passed. This would more accurately reflect the environment of a managed network.

Additional context There is a slight added bonus here that this would also capture any devices that give up sending DHCP Discover messages and/or revert to APIPA after X amount of time. DHCP server unavailability at end device power on time is a very real possibility, particularly with black building starts, ISTs, etc.

jhughesbiot commented 2 days ago

There is a default 60 seconds wait period after device detection for the device to resolve an IP address. After that, the default monitor period is 5 minutes (300 seconds) before any testing starts. Both of these values are configurable so this scenario should already be accounted for.

The issues you mention seem more aligned with testing network behaviors more than the device, which is a reason we do not have specific switch configurations in scope of testing, as there are too many factors in a network we do not intend to account for when it comes to specific device behaviors.

duncangreene commented 1 day ago

There is a default 60 seconds wait period after device detection for the device to resolve an IP address.

In this proposal the idea is to emulate the 15 second listening and 15 second learning STP states that could be expected on a managed network. During the listening and learning states (i.e. before the switchport gets to the forwarding state), all packets from the end device (e.g. DHCP Discover's) are completely ignored (with the exception of BPDUs) and so if we were to emulate this, no services (i.e. DHCP in particular) should be available to the DUT until we reach the emulated STP forwarding state.

jhughesbiot commented 1 day ago

With an already 60 second default delay (configurable) from power up to timeout for DHCP, I don't believe there's any additional work necessary to do what you are asking.

Concerning the network timing, this tool is designed around validating and implementing devices on a google network and so the requirements are for that specific configuration and implementation. It is not intended to account for all network configuration that exist. If there are use cases beyond this scope you feel is necessary for your testing, you can submit a PR to add this for review and we can work with you to confirm the best way of adding these types of features.

duncangreene commented 1 day ago

With an already 60 second default delay (configurable) from power up to timeout for DHCP

There proposal here is that the Testrun DHCP server (and other services) should not be available for at least 30 seconds after DUT power up/network interface up.

This may be better simply envisaged as a "DHCP server unavailable" test, to see how the DUT reacts when a DHCP server is unreachable for X amount of time. This could very likely happen in the following scenarios.

The device could react by reverting to APIPA or similar (in itself potentially not a problem), but crucially it could give up sending DHCP Discover messages due to poor implementation, likely requiring a power cycle to resolve. The implications of this on a device are large, as it is not always feasible to power down/power up OT kit ad hoc (critical loads, lighting devices, inrush currents, etc.).