Closed haoqixu closed 2 months ago
Attention: 129 lines
in your changes are missing coverage. Please review.
Comparison is base (
efddaa2
) 76.11% compared to head (71b4d40
) 72.43%. Report is 31 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I am trying to follow the case 1 timeline you mentioned.
Case 1: the timeline is as follows:
- (1) calls
c.SetAgentDescription
to schedule an message sending.- (2) send the first message to the server.
- (2.b) response from server is handled by the receiver loop, which update the instance ID of scheduled message.
- (2.a) handle scheduled message and send it to the server, server store the new instance ID to rcvAgentInstanceUid.
- (3) the first
eventually
assert fails to get the original instance ID
2.b is triggered by the client sending the scheduled message (from 2.a step) so why does the first assertion fail?
I try to draw a sequence diagram of case 1:
sequenceDiagram
participant Client
participant Client.Receiver
participant Client.Sender
participant Server
Client->>Client.Sender: (1) c.SetAgentDescription in `prepareClient` call `sender.ScheduleSend()`
Client->>Server: (2) AgentToServer with original UID (via s.sendNextMessage in sender.Start)
alt
Server->>Server: store the original instance ID to rcvAgentInstanceUid
end
Client-->>Client.Sender: start the loop
Client-->>Client.Receiver: start the loop
Server->>Client.Receiver: (2.b)ServerToAgent with new UID
Client.Receiver->>Client.Sender: Update the pendding Message with new UID
Client.Sender->>Server: (2.a) send the pending scheduled message with new UID
alt Oops!
Server->>Server: store the new instance ID to rcvAgentInstanceUid
end
Server ->> Client.Receiver: reply ...
Client->>Client: run the first eventually assertion and fail
2.b will update the message on processing the ServerToAgent message. If this happends before sender sending the pending message, rcvAgentInstanceUid will be updated with the new UID.
There are two times we can see when the ScheduleSend is invoked which sends the message to the Server. The first one is from the SetAgentDescription
in prepareClient and then from the SetAgentDescription
of Send a dummy message. The second dummy message is only sent after the first assertion is completed. So my question is why does the first assertion fail because, for rcvAgentInstanceUid to have the new UID, the dummy message should have been sent which can't be true because the dummy message is only sent after the the first asseetion passes. Or am I missing something?
the dummy message is only sent after the the first asseetion passes
assert.Eventually()
does not terminate the test even the assertion has failed. So we can always see the dummy message sent even the first assertion has failed.
I am not sure I follow. What does this PR attempt to fix then? The first assertion is expected to always pass. If you look at the last failed runs in the issue, they point out Condition never satisfied at the first assertion. I think any fix should involve figuring out why it failed.
The first assertion is used to check whether the server has received the original uid. In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid
(atomic.Value) was overwritten by the second AgentToServer message.
This PR replaces the atomic.Value with a buffered channel to record uids received at server side to avoid this case.
In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid(atomic.Value) was overwritten by the second AgentToServer message.
How did you arrive at this conclusion? It could be because no value was set to rcvAgentInstanceUid
and the assertion failed. I don't see how an override is the reason for the first assertion failure unless I am missing something.
It could be because no value was set to
rcvAgentInstanceUid
and the assertion failed.
If no value was set to rcvAgentInstanceUid
, both the eventually
assertions will failed. In the failed runs in the issue, only the first assertion failed.
https://github.com/haoqixu/opamp-go/tree/debug-pr-216 I have submitted some test log to this branch.
2023/12/05 23:33:17 test begin
2023/12/05 23:33:17 client: SetAgentDescription and ScheduleSend
2023/12/05 23:33:17 old instance uid is 01HGX9SCX1EQPHGGJ9X6Y1JSTY, starting client
2023/12/05 23:33:17 client: send message at Start
2023/12/05 23:33:17 server: store uid 01HGX9SCX1EQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:33:17 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 client: sending pending message
2023/12/05 23:33:17 server: store uid 01HGX9SCX106AFVGQT5ZYC0GEK to rcvAgentInstanceUid <------- this line
2023/12/05 23:33:17 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
(... first assert message ...)
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 send dummy message
2023/12/05 23:33:22 client: SetAgentDescription and ScheduleSend
2023/12/05 23:33:22 client: sending pending message
2023/12/05 23:33:22 server: store uid 01HGX9SCX106AFVGQT5ZYC0GEK to rcvAgentInstanceUid
2023/12/05 23:33:22 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 second assert: want 01HGX9SCX106AFVGQT5ZYC0GEK, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 test end
--- FAIL: TestAgentIdentification (5.06s)
--- FAIL: TestAgentIdentification/ws (5.04s)
clientimpl_test.go:94:
Error Trace: clientimpl_test.go:94
clientimpl_test.go:653
clientimpl_test.go:77
Error: Condition never satisfied
Test: TestAgentIdentification/ws
This is the Case 1.
2023/12/05 23:35:01 test begin
2023/12/05 23:35:01 client: SetAgentDescription and ScheduleSend
2023/12/05 23:35:01 old instance uid is 01HGX9WJPKEQPHGGJ9X6Y1JSTY, starting client
2023/12/05 23:35:01 client: send message at Start
2023/12/05 23:35:01 client: sending pending message
2023/12/05 23:35:01 server: store uid 01HGX9WJPKEQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:35:01 first assert: want 01HGX9WJPKEQPHGGJ9X6Y1JSTY, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 send dummy message
2023/12/05 23:35:01 client: SetAgentDescription and ScheduleSend
2023/12/05 23:35:01 client: sending pending message
2023/12/05 23:35:01 client: set instance uid of pending message 01HGX9WJPK06AFVGQT5ZYC0GEK <------- this line
2023/12/05 23:35:01 server: store uid 01HGX9WJPKEQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:35:01 client: set instance uid of pending message 01HGX9WJPK06AFVGQT5ZYC0GEK
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
(... second assert ...)
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 test end
--- FAIL: TestAgentIdentification (5.02s)
--- FAIL: TestAgentIdentification/ws (5.02s)
clientimpl_test.go:94:
Error Trace: clientimpl_test.go:94
clientimpl_test.go:672
clientimpl_test.go:77
Error: Condition never satisfied
Test: TestAgentIdentification/ws
This is the Case 2.
Were you able to easily reproduce the original issue? Can you share how did you reproduce?
I can reproduce it by running go test -run TestAgentIdentification/ws -race -count 10000 ./...
and I have submitted the output here: https://github.com/haoqixu/opamp-go/tree/debug-pr-216/client/testlog
Thanks, I will take a look.
Sorry for the lack of response here for some time. I have been sick and swamped with a lot of work at $job. I will find time to get back to review PRs soon.
The first assertion is used to check whether the server has received the original uid. In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid(atomic.Value) was overwritten by the second AgentToServer message.
Coming back to this point, this says the first assertion failed because id was overwritten by second AgentToServer
. However, the second AgentToServer
is the dummy message which is sent after the eventually
which is blocking so my reasoning and question is how does the second message cause the first assertion failure because they are sequential instructions. If we look at the code we expect the second assertion to override but not simultaneously with the first assertion.
the second
AgentToServer
is the dummy message
The second AgentToServer
is not the dummy message. It is the pending message scheduled by SetAgentDescription
.
Thanks, I went through the ws sender code again. I think the additional sendNextMessage
invocation shouldn't be there in ws sender https://github.com/open-telemetry/opamp-go/blob/ca0e1e68754fdf9b131419e0629ed2d99bdae713/client/internal/wssender.go#L34-L42 All the client messages that should be sent to server would already have called the ScheduleSend
and we expect the SetAgentDescription
must be called before Start
which schedules the message. I believe this is why we don't see the same flaky behaviour in the http client.
I believe this is why we don't see the same flaky behaviour in the http client.
Yes. But I don't think test code should treat the above cases as a failure. Even with this extra message, the client uses the original ID first and then sends subsequent messages with the new ID after receiving the reply as expected. I think we should update the test code so that the above two cases pass. I would like to submit anohter PR to avoid the additional message.
The test is written with certain assumptions and it failed because of the additional message, if we fix that test would pass and we don't need to fix any tests.
Even with this extra message, the client uses the original ID first and then sends subsequent messages with the new ID after receiving the reply as expected
I don't think the extra message is incorrect (although it could definitely be improved). The observable behavior of the client is as expected, but our detection method cannot handle it and fail to produce the same outcome with each individual test run.
if we fix that test would pass and we don't need to fix any tests.
Case 2 is not caused by the additional message and we still need to need the test.
This test is flaky, it usually succeeds but occasionally fails. If we want to avoid extra messages, it is not enough to prove it, we still need a separate test. So I prefer that this flaky test code should be fixed first to make its results consistent.
It has been a while and the test has been changed since it failed and does not appear to be failing any more, see here.
Closing this PR, but we can reopen if we see failures again.
Resolve #191
TestAgentIdentification executes the following steps to test the client:
prepareClient()
: setup settings and callSetAgentDescription
of client which schedules an message sending.client.Start()
: send the first message (a) launch an goroutine to handle scheduled message sending. (b) start the receiver loop to handle messages from server.eventually
assert.client.SetAgentDescription()
to schedule an message sending.eventually
assert.scheduled message sending in (2.a) and messages handling (2.b) are running concurrently.
Case 1: the timeline is as follows:
c.SetAgentDescription
to schedule an message sending.eventually
assert fails to get the original instance IDCase 2: the timeline is as follows:
c.SetAgentDescription
to schedule an message sending.eventually
assert trys to get the original instance ID and will return successfully after server has stored the instance ID in the first message to rcvAgentInstanceUid.client.SetAgentDescription()
to schedule an message sendingeventually
assert failed to get the new instance uid.The solution here is replacing atomic.Value with a buffered channel to avoid case 1 and waiting for the first response from server to avoid case 2.