ovn-org / ovn-heater

Mega script to deploy/configure/run OVN scale tests.
Apache License 2.0
12 stars 12 forks source link

ovn-tester: Most timers are not useful #85

Open putnopvut opened 2 years ago

putnopvut commented 2 years ago

I recently did an audit of all places where the @ovn_stats.timeit decorator is used. What I found was that the only truly useful places are in:

The rest of the timed operations essentially call ovn-nbctl a bunch of times. Since we use the ovn-nbctl daemon on the central nodes, any ovn-nbctl command is likely to complete near-instantly, especially write-only operations. Therefore, what we're really timing here is the roundtrip time for SSH command execution, not OVN. As an example look at the Namespace.add_ports graph in the comment below this one. Aside from an oddity on iteration 17, the iterations take around a tenth of a second. This is because all that's being measured here are some ovn-nbctl calls that add addresses to an address set and adds ports to a port group.

The oddity at iteration 17 is interesting, but it runs afoul of a second issue with ovn-tester's timers: the timers do a poor job of illustrating the bottleneck(s). If you look at that graph, can you determine why iteration 17 took 25 seconds instead of a tenth of a second? That could be due to a network error that caused one or more SSH commands to take multiple attempts. Or it may be that the ovn-nbctl daemon got disconnected from the NBDB temporarily and had to reconnect, pulling down thousands of records and delaying the execution of a queued command. Or it could be something else entirely.

This problem also extends to the "useful" timers I mentioned above. When we time WorkerNode.ping_port(), This includes the SSH connection overhead, plus python client code execution (such as multiple datetime.now() calls). Therefore, if we see an oddity in a graph, it's difficult to pin the blame directly on OVN. We could have just lost network connectivity between ovn-tester and the worker node, for instance.

How can we fix this? There are a few steps to take:

  1. Determine what measurements we actually care about. There are several things we are measuring right now that likely can just be removed.
  2. Ensure we have the needed tools to measure what we want. As an example, let's say we want to determine the time it takes from when a logical switch port is created to when it is installed on a node. We would probably want to take the following measurements:
    • Timestamp when ovn-tester issued the command to add the logical switchport
    • Timestamp when the logical switch port was added to the NBDB
    • Timestamp when the corresponding Port Binding was added to the SBDB
    • Timestamp when the corresponding Interface in OVS has external_ids:ovn-installled set.
    • (Possibly) Timestamp when certain expected flows are installed. Right now, there is nothing that can monitor the databases in real time and note the necessary timestamps.
  3. Change the ovn-tester code to not have waits in the middle of tests. If we have the tools to measure times appropriately, then ovn-tester can operate by hammering OVN with commands. After the test is complete, we can then gather the statistics and log them.

[1] The usefulness of timing pings may disappear when asyncio code is merged. The asyncio PR removes all pings that were used to determine when a port comes up. Instead, it uses ovn-installed-ts. The remaining use case for pings is for testing ACLs, and in that case, the fact the pings succeed is what's important, not how long it takes for the pings to start succeeding.

putnopvut commented 2 years ago

I removed a link from the issue report because it is not publicly accessible. Instead, I'll embed the graph I was referring to here. add_ports

dceara commented 2 years ago

I recently did an audit of all places where the @ovn_stats.timeit decorator is used. What I found was that the only truly useful places are in:

* `WorkerNode.wait()`: After provisioning a worker node, this waits for the corresponding `Chassis` record to appear in the SBDB.

* `WorkerNode.ping_port()`: This determines how long it takes before pings are successful [1] .

The rest of the timed operations essentially call ovn-nbctl a bunch of times. Since we use the ovn-nbctl daemon on the central nodes, any ovn-nbctl command is likely to complete near-instantly, especially write-only operations. Therefore, what we're really timing here is the roundtrip time for SSH command execution, not OVN. As an example look at the Namespace.add_ports graph in the comment below this one. Aside from an oddity on iteration 17, the iterations take around a tenth of a second. This is because all that's being measured here are some ovn-nbctl calls that add addresses to an address set and adds ports to a port group.

I don't really agree with this statement. ovn-nbctl commands (on the client side) wait until the nbctl daemon (server side) processed the unixctl command (unixctl_client_transact()). So we're not only measuring the roundtrip time for the SSH command execution but we may also detect if nbctl daemon is busy at that point.

The oddity at iteration 17 is interesting, but it runs afoul of a second issue with ovn-tester's timers: the timers do a poor job of illustrating the bottleneck(s). If you look at that graph, can you determine why iteration 17 took 25 seconds instead of a tenth of a second? That could be due to a network error that caused one or more SSH commands to take multiple attempts. Or it may be that the ovn-nbctl daemon got disconnected from the NBDB temporarily and had to reconnect, pulling down thousands of records and delaying the execution of a queued command. Or it could be something else entirely.

IMO it's still useful to know that something happened around the time we were doing that nbctl command. That compared to just getting a measurement for the complete iteration in which case we'd really have no clue where the problem is.

This problem also extends to the "useful" timers I mentioned above. When we time WorkerNode.ping_port(), This includes the SSH connection overhead, plus python client code execution (such as multiple datetime.now() calls). Therefore, if we see an oddity in a graph, it's difficult to pin the blame directly on OVN. We could have just lost network connectivity between ovn-tester and the worker node, for instance.

Again, I'm not sure I understand why it's not useful to know that "something" happened at this point.

How can we fix this? There are a few steps to take:

1. Determine what measurements we actually care about. There are several things we are measuring right now that likely can just be removed.

I agree, an audit is needed to remove some of the measurements (and maybe add others).

2. Ensure we have the needed tools to measure what we want. As an example, let's say we want to determine the time it takes from when a logical switch port is created to when it is installed on a node. We would probably want to take the following measurements:

We have most of these already and for some of the others there's WIP already:

   * Timestamp when ovn-tester issued the command to add the logical switchport

This gets logged in the nbctl daemon logs. If we move to IDL however, we'd need a way to log it there.

   * Timestamp when the logical switch port was added to the NBDB

https://patchwork.ozlabs.org/project/openvswitch/list/?series=262657&state=*

   * Timestamp when the corresponding Port Binding was added to the SBDB

https://patchwork.ozlabs.org/project/openvswitch/list/?series=262657&state=*

   * Timestamp when the corresponding Interface in OVS has `external_ids:ovn-installled` set.

https://github.com/ovn-org/ovn/commit/06a50caa39ccb7cdefb492b7e97ab2c2f4769df5

   * (Possibly) Timestamp when certain expected flows are installed.

How is this different from the above? Or do you mean in general, e.g., for ACL related flows?

     Right now, there is nothing that can monitor the databases in real time and note the necessary timestamps.

Indeed I'm afraid the only option right now is to scrape logs from all components. However, if we make sure we have logs for all of the points above, another option would be to configure logs to be sent to a centralized syslog server (e.g., on the ovn tester node). This would make data collection/analysis easier.

3. Change the ovn-tester code to not have waits in the middle of tests. If we have the tools to measure times appropriately, then ovn-tester can operate by hammering OVN with commands. After the test is complete, we can then gather the statistics and log them.

Agreed.

[1] The usefulness of timing pings may disappear when asyncio code is merged. The asyncio PR removes all pings that were used to determine when a port comes up. Instead, it uses ovn-installed-ts. The remaining use case for pings is for testing ACLs, and in that case, the fact the pings succeed is what's important, not how long it takes for the pings to start succeeding.

I don't agree with the second part of this statement. We actually are interested in "how long it takes" to enforce a network policy (i.e., time it takes for traffic to be allowed by an ACL).

putnopvut commented 2 years ago

Thanks for the reply Dumitru. I've got comments in-line below.

I recently did an audit of all places where the @ovn_stats.timeit decorator is used. What I found was that the only truly useful places are in:

* `WorkerNode.wait()`: After provisioning a worker node, this waits for the corresponding `Chassis` record to appear in the SBDB.

* `WorkerNode.ping_port()`: This determines how long it takes before pings are successful [1] .

The rest of the timed operations essentially call ovn-nbctl a bunch of times. Since we use the ovn-nbctl daemon on the central nodes, any ovn-nbctl command is likely to complete near-instantly, especially write-only operations. Therefore, what we're really timing here is the roundtrip time for SSH command execution, not OVN. As an example look at the Namespace.add_ports graph in the comment below this one. Aside from an oddity on iteration 17, the iterations take around a tenth of a second. This is because all that's being measured here are some ovn-nbctl calls that add addresses to an address set and adds ports to a port group.

I don't really agree with this statement. ovn-nbctl commands (on the client side) wait until the nbctl daemon (server side) processed the unixctl command (unixctl_client_transact()). So we're not only measuring the roundtrip time for the SSH command execution but we may also detect if nbctl daemon is busy at that point.

Fair point. I guess it's not correct to say that we're only measuring the SSH roundtrip time, since there is the possibility that there could be some issue with the nbctl daemon.

The oddity at iteration 17 is interesting, but it runs afoul of a second issue with ovn-tester's timers: the timers do a poor job of illustrating the bottleneck(s). If you look at that graph, can you determine why iteration 17 took 25 seconds instead of a tenth of a second? That could be due to a network error that caused one or more SSH commands to take multiple attempts. Or it may be that the ovn-nbctl daemon got disconnected from the NBDB temporarily and had to reconnect, pulling down thousands of records and delaying the execution of a queued command. Or it could be something else entirely.

IMO it's still useful to know that something happened around the time we were doing that nbctl command. That compared to just getting a measurement for the complete iteration in which case we'd really have no clue where the problem is.

I think that I need to clarify the thesis behind when I say timers are not "useful", since in retrospect I was vague (and also partially incorrect). Think about it from the perspective of us trying to quantify how long it takes for OVN to perform operations. Using Namespace.add_ports as the example again, the time reported doesn't give any insight beyond the handoff from the ovn-nbctl daemon to the northbound database. Any time spent by ovn-northd, the southbound database, and ovn-controller are not reflected in this time. Wouldn't it also be useful to know when those Address_Set and Port_Group records showed up in the southbound database? Wouldn't it also be useful to know when those records caused a change in OVS?

In retrospect, I shouldn't have implied that timing the ovn-nbctl commands is useless. These readings can be good for diagnostic purposes, especially if we had a hunch there was some issue with the nbctl daemon. But when reporting how efficient OVN is at performing operations, these sort of times only explain things at the surface level.

This problem also extends to the "useful" timers I mentioned above. When we time WorkerNode.ping_port(), This includes the SSH connection overhead, plus python client code execution (such as multiple datetime.now() calls). Therefore, if we see an oddity in a graph, it's difficult to pin the blame directly on OVN. We could have just lost network connectivity between ovn-tester and the worker node, for instance.

Again, I'm not sure I understand why it's not useful to know that "something" happened at this point.

How can we fix this? There are a few steps to take:

1. Determine what measurements we actually care about. There are several things we are measuring right now that likely can just be removed.

I agree, an audit is needed to remove some of the measurements (and maybe add others).

I would advise that this audit be carried out from a fresh perspective. In other words, if ovn-tester didn't exist yet and we wanted to start performance testing OVN, what measurements would we want? We can then compare that list of measurements to what we have and see how we can improve.

2. Ensure we have the needed tools to measure what we want. As an example, let's say we want to determine the time it takes from when a logical switch port is created to when it is installed on a node. We would probably want to take the following measurements:

We have most of these already and for some of the others there's WIP already:

   * Timestamp when ovn-tester issued the command to add the logical switchport

This gets logged in the nbctl daemon logs. If we move to IDL however, we'd need a way to log it there.

   * Timestamp when the logical switch port was added to the NBDB

https://patchwork.ozlabs.org/project/openvswitch/list/?series=262657&state=*

   * Timestamp when the corresponding Port Binding was added to the SBDB

https://patchwork.ozlabs.org/project/openvswitch/list/?series=262657&state=*

   * Timestamp when the corresponding Interface in OVS has `external_ids:ovn-installled` set.

ovn-org/ovn@06a50ca

Thanks for providing links to the relevant commits/patches. This goes to show that we at least partially have the necessary tools to measure times in a more fine-grained manner. The problem is that before we can know that we have all the tools we need, we need to do that first step of ensuring we know everything that we want to measure :) .

   * (Possibly) Timestamp when certain expected flows are installed.

How is this different from the above? Or do you mean in general, e.g., for ACL related flows?

Hopefully there is no difference. I don't think that for the case of adding a logical switch port this should actually be necessary, which is why I put the "(Possibly)" there. The reason this went through my head was just because I was trying to think of every timestamp we might possibly want to get throughout adding the logical switch port. And if there is some discrepancy between when ovn-installed is set and when flows are installed, this reporting would tell us.

But generally, you're also correct that other operations may (currently) only be able to use the presence or absence of flows in the flow table as a method to know the command has been processed. As you pointed out, ACLs are one of those situations. We don't have anything we can monitor in OVS except for flow installation.

     Right now, there is nothing that can monitor the databases in real time and note the necessary timestamps.

Indeed I'm afraid the only option right now is to scrape logs from all components. However, if we make sure we have logs for all of the points above, another option would be to configure logs to be sent to a centralized syslog server (e.g., on the ovn tester node). This would make data collection/analysis easier.

Yes that is a potential option. Another option I thought of is to install agent programs on all nodes to monitor for the creation of records and then report timestamps for those events back to ovn-tester. I'm honestly not sure which of those ideas is actually going to be easier to implement.

3. Change the ovn-tester code to not have waits in the middle of tests. If we have the tools to measure times appropriately, then ovn-tester can operate by hammering OVN with commands. After the test is complete, we can then gather the statistics and log them.

Agreed.

[1] The usefulness of timing pings may disappear when asyncio code is merged. The asyncio PR removes all pings that were used to determine when a port comes up. Instead, it uses ovn-installed-ts. The remaining use case for pings is for testing ACLs, and in that case, the fact the pings succeed is what's important, not how long it takes for the pings to start succeeding.

I don't agree with the second part of this statement. We actually are interested in "how long it takes" to enforce a network policy (i.e., time it takes for traffic to be allowed by an ACL).

If we're interested in how long it takes to enforce a network policy, then that's fine. But I still think that pings are not the best tool for it. We should be able to tell that an ACL is "activated" through other means, such as checking for flows to be installed (as you hinted at previously), by scraping logs, or through some new means. The ping is a good follow-up to ensure that the ACL is actually being enforced.