Issue1 (rebased on Issue2)

kgiusti commented 7 years ago

This needs more work in the case of cast.

A race condition exists, specifically:

1) clients send casts, then make an RPC call to the controller to indicate all casts sent 2) controller waits for all clients to report in 3) then controller casts to servers to have them report their results

Since casts are sent async, the client can complete sending before all cast messages have been consumed by servers (e.g. some are still in flight).

When step 3 occurs there may still be casts in flight. Should the servers receive the report message from the controller before all casts have arrived, the results returned will not take into account any casts that are still in flight and arrive after the results have been reported.

The difficulty is that servers are unable to definitively determine when there is no more pending cast traffic. Given multiple servers subscribed to the same topic it's possible for a server to get all, some, or no casts depending on consumer scheduling, flow variations and other non-deterministic factors.

Need to think a bit more on this....

msimonin commented 7 years ago

I see your point :) As far as I've understood the code : the controller first tries to get a view of the system by counting the minions (client and servers in play). If this view is accurate, I've the feeling that the controller can know the exact number of messages that all the servers should receive (clients * count or clients * count * servers for fanout rpc). If the servers stat includes the number of received messages, the controller could retry query_servers(RPC_SERVER) until the total number of messages received match the expected total number or timeout. In the latter, missing messages could be declared lost :(.

kgiusti commented 7 years ago

I've submitted a fix for this a part of this refactoring: https://github.com/kgiusti/ombt/commit/3d74b8d28bef55b09035bf0676ad119e8726dd9a

In this approach the controller periodically polls the servers after the clients have completed the test run. The poll repeats until all servers stop reporting new message arrivals - at that point it's pretty safe to consider the test completed.

There's also more error information that is passed back from the servers and clients which will make it easier to report error conditions.

This was a major refactor and further testing is needed, so let me know if (when) you hit an issue.

msimonin commented 7 years ago

Great refactor @kgiusti !

kgiusti / ombt

Issue1 (rebased on Issue2) #5