open-telemetry / opamp-go

OpAMP protocol implementation in Go
Apache License 2.0
141 stars 69 forks source link

Fix flaky TestAgentIdentification/ws #216

Closed haoqixu closed 2 months ago

haoqixu commented 9 months ago

Resolve #191

TestAgentIdentification executes the following steps to test the client:

  1. prepareClient(): setup settings and call SetAgentDescription of client which schedules an message sending.
  2. client.Start(): send the first message (a) launch an goroutine to handle scheduled message sending. (b) start the receiver loop to handle messages from server.
  3. run the first eventually assert.
  4. client.SetAgentDescription() to schedule an message sending.
  5. run the second eventually assert.

scheduled message sending in (2.a) and messages handling (2.b) are running concurrently.

Case 1: the timeline is as follows:

Case 2: the timeline is as follows:

The solution here is replacing atomic.Value with a buffered channel to avoid case 1 and waiting for the first response from server to avoid case 2.

codecov[bot] commented 9 months ago

Codecov Report

Attention: 129 lines in your changes are missing coverage. Please review.

Comparison is base (efddaa2) 76.11% compared to head (71b4d40) 72.43%. Report is 31 commits behind head on main.

Files Patch % Lines
internal/certs.go 0.00% 118 Missing :warning:
client/internal/httpsender.go 88.00% 2 Missing and 1 partial :warning:
client/internal/mockserver.go 85.71% 2 Missing and 1 partial :warning:
server/httpconnection.go 0.00% 2 Missing :warning:
client/internal/clientcommon.go 90.90% 1 Missing :warning:
client/internal/packagessyncer.go 0.00% 1 Missing :warning:
server/serverimpl.go 95.00% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #216 +/- ## ========================================== - Coverage 76.11% 72.43% -3.69% ========================================== Files 24 25 +1 Lines 1834 2017 +183 ========================================== + Hits 1396 1461 +65 - Misses 326 446 +120 + Partials 112 110 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

srikanthccv commented 9 months ago

I am trying to follow the case 1 timeline you mentioned.

Case 1: the timeline is as follows:

  • (1) calls c.SetAgentDescription to schedule an message sending.
  • (2) send the first message to the server.
  • (2.b) response from server is handled by the receiver loop, which update the instance ID of scheduled message.
  • (2.a) handle scheduled message and send it to the server, server store the new instance ID to rcvAgentInstanceUid.
  • (3) the first eventually assert fails to get the original instance ID

2.b is triggered by the client sending the scheduled message (from 2.a step) so why does the first assertion fail?

haoqixu commented 9 months ago

I try to draw a sequence diagram of case 1:

sequenceDiagram
participant Client
participant Client.Receiver
participant Client.Sender
participant Server
    Client->>Client.Sender: (1) c.SetAgentDescription in `prepareClient` call `sender.ScheduleSend()`
    Client->>Server: (2) AgentToServer with original UID (via s.sendNextMessage in sender.Start)
alt
    Server->>Server: store the original instance ID to rcvAgentInstanceUid
end
    Client-->>Client.Sender: start the loop
    Client-->>Client.Receiver: start the loop
    Server->>Client.Receiver: (2.b)ServerToAgent with new UID
    Client.Receiver->>Client.Sender: Update the pendding Message with new UID
    Client.Sender->>Server: (2.a) send the pending scheduled message with new UID
alt Oops!
    Server->>Server: store the new instance ID to rcvAgentInstanceUid
end
   Server ->> Client.Receiver: reply ...
   Client->>Client: run the first eventually assertion and fail

2.b will update the message on processing the ServerToAgent message. If this happends before sender sending the pending message, rcvAgentInstanceUid will be updated with the new UID.

srikanthccv commented 9 months ago

There are two times we can see when the ScheduleSend is invoked which sends the message to the Server. The first one is from the SetAgentDescription in prepareClient and then from the SetAgentDescription of Send a dummy message. The second dummy message is only sent after the first assertion is completed. So my question is why does the first assertion fail because, for rcvAgentInstanceUid to have the new UID, the dummy message should have been sent which can't be true because the dummy message is only sent after the the first asseetion passes. Or am I missing something?

haoqixu commented 9 months ago

the dummy message is only sent after the the first asseetion passes

assert.Eventually() does not terminate the test even the assertion has failed. So we can always see the dummy message sent even the first assertion has failed.

srikanthccv commented 9 months ago

I am not sure I follow. What does this PR attempt to fix then? The first assertion is expected to always pass. If you look at the last failed runs in the issue, they point out Condition never satisfied at the first assertion. I think any fix should involve figuring out why it failed.

haoqixu commented 9 months ago

The first assertion is used to check whether the server has received the original uid. In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid(atomic.Value) was overwritten by the second AgentToServer message.

This PR replaces the atomic.Value with a buffered channel to record uids received at server side to avoid this case.

srikanthccv commented 9 months ago

In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid(atomic.Value) was overwritten by the second AgentToServer message.

How did you arrive at this conclusion? It could be because no value was set to rcvAgentInstanceUid and the assertion failed. I don't see how an override is the reason for the first assertion failure unless I am missing something.

haoqixu commented 9 months ago

It could be because no value was set to rcvAgentInstanceUid and the assertion failed.

If no value was set to rcvAgentInstanceUid, both the eventually assertions will failed. In the failed runs in the issue, only the first assertion failed.

haoqixu commented 9 months ago

https://github.com/haoqixu/opamp-go/tree/debug-pr-216 I have submitted some test log to this branch.

haoqixu commented 9 months ago
2023/12/05 23:33:17 test begin
2023/12/05 23:33:17 client: SetAgentDescription and ScheduleSend
2023/12/05 23:33:17 old instance uid is 01HGX9SCX1EQPHGGJ9X6Y1JSTY, starting client
2023/12/05 23:33:17 client: send message at Start
2023/12/05 23:33:17 server: store uid 01HGX9SCX1EQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:33:17 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 client: sending pending message
2023/12/05 23:33:17 server: store uid 01HGX9SCX106AFVGQT5ZYC0GEK to rcvAgentInstanceUid  <------- this line
2023/12/05 23:33:17 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
(... first assert message ...)
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 send dummy message
2023/12/05 23:33:22 client: SetAgentDescription and ScheduleSend
2023/12/05 23:33:22 client: sending pending message
2023/12/05 23:33:22 server: store uid 01HGX9SCX106AFVGQT5ZYC0GEK to rcvAgentInstanceUid
2023/12/05 23:33:22 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 second assert: want 01HGX9SCX106AFVGQT5ZYC0GEK, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 test end
--- FAIL: TestAgentIdentification (5.06s)
    --- FAIL: TestAgentIdentification/ws (5.04s)
        clientimpl_test.go:94: 
                Error Trace:    clientimpl_test.go:94
                                            clientimpl_test.go:653
                                            clientimpl_test.go:77
                Error:          Condition never satisfied
                Test:           TestAgentIdentification/ws

This is the Case 1.

haoqixu commented 9 months ago
2023/12/05 23:35:01 test begin
2023/12/05 23:35:01 client: SetAgentDescription and ScheduleSend
2023/12/05 23:35:01 old instance uid is 01HGX9WJPKEQPHGGJ9X6Y1JSTY, starting client
2023/12/05 23:35:01 client: send message at Start
2023/12/05 23:35:01 client: sending pending message
2023/12/05 23:35:01 server: store uid 01HGX9WJPKEQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:35:01 first assert: want 01HGX9WJPKEQPHGGJ9X6Y1JSTY, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 send dummy message
2023/12/05 23:35:01 client: SetAgentDescription and ScheduleSend
2023/12/05 23:35:01 client: sending pending message
2023/12/05 23:35:01 client: set instance uid of pending message 01HGX9WJPK06AFVGQT5ZYC0GEK   <------- this line
2023/12/05 23:35:01 server: store uid 01HGX9WJPKEQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:35:01 client: set instance uid of pending message 01HGX9WJPK06AFVGQT5ZYC0GEK
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
(... second assert ...)
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 test end
--- FAIL: TestAgentIdentification (5.02s)
    --- FAIL: TestAgentIdentification/ws (5.02s)
        clientimpl_test.go:94: 
                Error Trace:    clientimpl_test.go:94
                                            clientimpl_test.go:672
                                            clientimpl_test.go:77
                Error:          Condition never satisfied
                Test:           TestAgentIdentification/ws

This is the Case 2.

srikanthccv commented 9 months ago

Were you able to easily reproduce the original issue? Can you share how did you reproduce?

haoqixu commented 9 months ago

I can reproduce it by running go test -run TestAgentIdentification/ws -race -count 10000 ./... and I have submitted the output here: https://github.com/haoqixu/opamp-go/tree/debug-pr-216/client/testlog

srikanthccv commented 9 months ago

Thanks, I will take a look.

srikanthccv commented 9 months ago

Sorry for the lack of response here for some time. I have been sick and swamped with a lot of work at $job. I will find time to get back to review PRs soon.

srikanthccv commented 9 months ago

The first assertion is used to check whether the server has received the original uid. In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid(atomic.Value) was overwritten by the second AgentToServer message.

Coming back to this point, this says the first assertion failed because id was overwritten by second AgentToServer. However, the second AgentToServer is the dummy message which is sent after the eventually which is blocking so my reasoning and question is how does the second message cause the first assertion failure because they are sequential instructions. If we look at the code we expect the second assertion to override but not simultaneously with the first assertion.

haoqixu commented 9 months ago

the second AgentToServer is the dummy message

The second AgentToServer is not the dummy message. It is the pending message scheduled by SetAgentDescription.

图片
srikanthccv commented 9 months ago

Thanks, I went through the ws sender code again. I think the additional sendNextMessage invocation shouldn't be there in ws sender https://github.com/open-telemetry/opamp-go/blob/ca0e1e68754fdf9b131419e0629ed2d99bdae713/client/internal/wssender.go#L34-L42 All the client messages that should be sent to server would already have called the ScheduleSend and we expect the SetAgentDescription must be called before Start which schedules the message. I believe this is why we don't see the same flaky behaviour in the http client.

haoqixu commented 9 months ago

I believe this is why we don't see the same flaky behaviour in the http client.

Yes. But I don't think test code should treat the above cases as a failure. Even with this extra message, the client uses the original ID first and then sends subsequent messages with the new ID after receiving the reply as expected. I think we should update the test code so that the above two cases pass. I would like to submit anohter PR to avoid the additional message.

srikanthccv commented 9 months ago

The test is written with certain assumptions and it failed because of the additional message, if we fix that test would pass and we don't need to fix any tests.

haoqixu commented 9 months ago

Even with this extra message, the client uses the original ID first and then sends subsequent messages with the new ID after receiving the reply as expected

I don't think the extra message is incorrect (although it could definitely be improved). The observable behavior of the client is as expected, but our detection method cannot handle it and fail to produce the same outcome with each individual test run.

haoqixu commented 9 months ago

if we fix that test would pass and we don't need to fix any tests.

Case 2 is not caused by the additional message and we still need to need the test.

haoqixu commented 9 months ago

This test is flaky, it usually succeeds but occasionally fails. If we want to avoid extra messages, it is not enough to prove it, we still need a separate test. So I prefer that this flaky test code should be fixed first to make its results consistent.

tigrannajaryan commented 2 months ago

It has been a while and the test has been changed since it failed and does not appear to be failing any more, see here.

Closing this PR, but we can reopen if we see failures again.