Closed italovalcy closed 1 year ago
@italovalcy, great catch, much appreciated you looking into this.
Maybe we can improve OF_Core in two ways:
- Reset the list of pending multipart requests upon switch reconnect (I'm not sure the multipart requests will be answered after switch reconnection)
This upfront reset would be very beneficial I believe, minimizing the chances of discarding a subsequent iteration. Maybe we could reset the pending multipart requests whenever a connection is lost, subscribing to '.*.connection.lost'
like topology does?
- Add a handler for connection error and remove the XID from the list of multipart requests waiting to be answered (which is used to check for overlapping requests)
I think we'll also need to subscribe to kytos/core.openflow.connection.error
since (OSError, SocketError)
can be raised as a result, having this as well would help to cover cases where a socket crashed mid transmission but hasn't closed. It seems that asyncio.Protocol
connection lost will catch most things though.
thank you very much for the feedback @viniarck . Comments inline:
This upfront reset would be very beneficial I believe, minimizing the chances of discarding a subsequent iteration. Maybe we could reset the pending multipart requests whenever a connection is lost, subscribing to
'.*.connection.lost'
like topology does?
Yes, sounds like a good enhancement!
I think we'll also need to subscribe to
kytos/core.openflow.connection.error
since(OSError, SocketError)
can be raised as a result, having this as well would help to cover cases where a socket crashed mid transmission but hasn't closed. It seems thatasyncio.Protocol
connection lost will catch most things though.
Agreed! Listening to connection.error
would give of_core much more robustness to some corner cases like we saw in the e2e. Then we can keep evolving the actions to handle errors as we learn through out the process (now it seems like reset the pending multipart requests is one action to take, maybe we can come up with other actions)
Right. Let's add these two enhancements then.
@italovalcy, I'll go ahead and assign this one to me, I have capacity and I think it's doable to ship it in this release and it's also aligned with the enhancements/stabilization of things that can impact e2e that we have in this version, let's go for it.
Hi,
I was investigating the issue on the end-to-end tests, as shown below:
And I realized that the error above happened because the consistency check routing didn't execute as expected to remove the additional flow on table=2:
As we can see from the output above, the consistency check was executed for switches 01 and 03 only at 06:21:01. It should have run again around 06:21:13, but it didn't.
If we look for the time when switches reconnected:
The switches reconnected at 06:21:13, but only switch 02 could perform the flow stats request successfully.
Checking for overlapping multipart requests, we can clearly see that sw01 and sw03 got stuck in a pending request:
Looking for the transaction ID, we can confirm that the of_core was waiting on this request but the request was never sent to the switch due to the connection reconnection probably happening simultaneously:
It is interesting to note that flow_manager captured the connection.error with the XID, but since it was from a FlowStats Req (not FlowMod), obviously flow_manager discarded the error.
Maybe we can improve OF_Core in two ways: