ipfs / go-graphsync

Initial Implementation Of GraphSync Wire Protocol
Other
100 stars 38 forks source link

Racy responsemanager tests (diagnosis, needing a soluton) #273

Closed rvagg closed 2 years ago

rvagg commented 2 years ago

Two racy tests showed up in https://github.com/ipfs/go-graphsync/pull/244:

The thing these have in common is this pattern:

  1. ProcessRequests
  2. .. some "completion" condition involving response checking
  3. ProcessRequests
  4. expect certain outcome

In the case of the first test above, it processes 3 blocks and during the last block submits a "paused" update. The "completion" condition is receiving those 3 blocks. In the second test above, it runs an empty request that fails and the "completion" condition is a certain response error code. The both then go on to register hooks and run ProcessRequests again and assert that the new hook did a thing.

But the important bit is the completion condition doesn't quite get to actual completion, because they both involve looking at the responses, which are not quite the end of a response execution.

In the first test, we can watch for 3 blocks being sent, but then queryexecutor goes on to push through a FinishTask with the ErrPaused error that switches response.state = paused which is the required condition for the rest of the test to pass. But occasionally, "receivedNBlocks" returns and the test continues quick enough that we get to the state quick enough to have a proper paused condition. I haven't looked into exactly why the timing is a problem for the second test but my bet is that it's a similar situation where the time between response received and continuing the test allows an occasional race where the setup doesn't quite make it.

I'm not sure what to do about this, but it seems to me that we need a better "completion" condition than watching responses, we want to get through to the complete end of a queryexecutor execution as well.

rvagg commented 2 years ago

I thought https://github.com/ipfs/go-graphsync/pull/284 would fix this but it doesn't, I've tried inserting it before the failing assertions get involved to no avail. Still rare, but they seem to be easier to repro in CI than locally, probably to do with hardware differences.

What I've now noticed, is that these tests are failing @ ~10 seconds, and the context involved here has a timeout of 10 seconds. So this is a failure to execute, and perhaps does indicate problems beyond the scope of just testing, maybe we have a condition whereby a request can get stuck in the queue?

rvagg commented 2 years ago

closed by #287