Closed maxtruxa closed 3 years ago
For some reason the test case fails in CI when it doesn't on my machine... I won't get around to investigate this further until sometime this weekend.
The previous behavior left a potential "CloseSend" frame pending on the manager's read queue. This delayed detection of a potential connection loss until the next RPC is started on the connection, which discards the old frame because of the stream ID mismatch.
I don't understand. The manager will continue to forward messages to any stream that is not yet terminated, so even if a CloseSend
is buffered to be delivered, it is fine to be dropped because the only thing a CloseSend
can do is terminate a stream, and the stream must have already been terminated.
If you run any drpc code with the build tag -tags=debug
, it outputs a log of messages that were delivered. It looks something like
STR[0xc0000a91e0][1]: CloseSend()
MAN[0xc00011c000]: <s:1 m:1 kind:Invoke data:24>
MAN[0xc00011c000]: <s:1 m:2 kind:Message data:2>
STR[0xc000140000][1]: <s:1 m:2 kind:Message data:2>
MAN[0xc0000fe6c0]: <s:1 m:1 kind:Message data:2>
STR[0xc0000a91e0][1]: <s:1 m:1 kind:Message data:2>
STR[0xc000140000][1]: CloseSend()
MAN[0xc00011c000]: <s:1 m:3 kind:CloseSend data:0>
STR[0xc000140000][1]: <s:1 m:3 kind:CloseSend data:0>
MAN[0xc0000fe6c0]: <s:1 m:2 kind:CloseSend data:0>
STR[0xc0000a91e0][1]: <s:1 m:2 kind:CloseSend data:0>
STR[0xc000140000][1]: Close()
STR[0xc0000a91e0][1]: Close()
Maybe that can be helpful in figuring out what's going on?
For some reason the test case fails in CI when it doesn't on my machine... I won't get around to investigate this further until sometime this weekend.
This is because the test is racy. Try running it locally with go test ./drpcconn -count=100 -failfast
to get it to fail. The race is because the server portion can send a CloseSend
, the client can receive it and upon receiving it transition into the Finished
state (because there is no active read or write on the stream and both sides have signaled they will no longer send, so the stream is terminated), and the server test code has not called remoteClose
yet, so the connection itself is not closed.
I don't think there's a way to fix this. Fundamentally, the underlying transport being closed is asynchronous with respect to any messages being sent or received.
The test would not be racy if I could find a way to assert what I really want to test.
I'll try to explain the issue with func (c *Conn) doInvoke a bit better.
When invoking an RPC, the conn sends a sequence of these frames: [InvokeMetadata
] Invoke
Message
CloseSend
. It then waits for the server to respond with a Message
and exits. But the server doesn't send just a Message
, but CloseSend
as well. If the conn doesn't wait for the stream to really finish (as indicated by CloseSend
) the last frame is stuck in the manager's queue, because there is no active stream anymore. This prevents the manager from detecting that the transport was closed.
The manager is only unblocked when a new stream is created.
I rebased on top of #17 and used <-conn.Closed()
to wait for the connection to close. This removes the race condition.
Without the change to doInvoke
the test would hang until go test
decides to kill it. To prevent long hangs in case of a broken test, the channel read is aborted after one second. On my machine 10000 iterations take around 400 ms, so that timeout should be large enough to prevent false negatives.
When invoking an RPC, the conn sends a sequence of these frames: [
InvokeMetadata
]Invoke
Message
CloseSend
. It then waits for the server to respond with aMessage
and exits. But the server doesn't send just aMessage
, butCloseSend
as well. If the conn doesn't wait for the stream to really finish (as indicated byCloseSend
) the last frame is stuck in the manager's queue, because there is no active stream anymore. This prevents the manager from detecting that the transport was closed.
Ah, I was thinking of the server manager and not the client manager. That makes sense. I think I'm going to look into doing a more general fix at the manager level where it will either continue to deliver or discard stream messages even after the stream is terminated. There's some subtleties with that, and it may not be easy to do, so this is a great thing to merge in the mean time.
I'd love to help out with getting this passing CI by fixing some of the linting failures and stuff, but I don't understand github. Normally, I do all the code reviews and changes through storj's gerrit instance (https://review.dev.storj.io). It's totally fine to use either one (we have bi-directional mirroring set up) and I'm happy to review on either. How do people normally do this sort of thing on github, though? lol
No worries. I fixed the linter errors.
You can review PRs on the "files changed" tab. To comment on a particular line, hover over the line and click on the blue plus sign that appears next to the line. You can approve the PR or request changes from that page as well ("review changes" in the top right). I always create PRs with the "allow edits from maintainers" flag set, so - since you have write access to this repo - you should have write access to the PR branch on my fork. Feel free to push any desired changes.
Regarding changes to the manager: IMO the root cause of this issue is automatically removed once #14 is resolved. Since the manager (or some sort of stream multiplexer) would have to know about which streams are currently active to dispatch frames to them, implementing logic like "this stream doesn't exist; discard the frame" would be appropriate and prevent the manager from ever blocking like this.
I always create PRs with the "allow edits from maintainers" flag set
Ah, this was what I was missing. Thanks for letting me know!
Since the manager (or some sort of stream multiplexer) would have to know about which streams are currently active to dispatch frames to them, implementing logic like "this stream doesn't exist; discard the frame" would be appropriate and prevent the manager from ever blocking like this.
Agreed if the multiplexing was internal, but I don't think I want to go that direction. That's a good way of thinking about it, though. It's like that but it'll only have to worry about a single stream existing. :smile: A weird thing to consider is that currently a server could theoretically presend responses to the next RPC a client might make. There's no way that is a useful thing to keep.
So at Storj we have a "two reviewers required" policy. I'm going to pester some folks to get this merged.
The previous behavior left a potential "CloseSend" frame pending on the manager's read queue. This delayed detection of a potential connection loss until the next RPC is started on the connection, which discards the old frame because of the stream ID mismatch.
I'm sure the test can be improved; if only by simplifying
testTransport
.This change is a requirement for potential solutions to #11.