storj / drpc

drpc is a lightweight, drop-in replacement for gRPC
MIT License
1.49k stars 50 forks source link

drpcconn: fully read unary RPCs #16

Closed maxtruxa closed 3 years ago

maxtruxa commented 3 years ago

The previous behavior left a potential "CloseSend" frame pending on the manager's read queue. This delayed detection of a potential connection loss until the next RPC is started on the connection, which discards the old frame because of the stream ID mismatch.

I'm sure the test can be improved; if only by simplifying testTransport.

This change is a requirement for potential solutions to #11.

maxtruxa commented 3 years ago

For some reason the test case fails in CI when it doesn't on my machine... I won't get around to investigate this further until sometime this weekend.

zeebo commented 3 years ago

The previous behavior left a potential "CloseSend" frame pending on the manager's read queue. This delayed detection of a potential connection loss until the next RPC is started on the connection, which discards the old frame because of the stream ID mismatch.

I don't understand. The manager will continue to forward messages to any stream that is not yet terminated, so even if a CloseSend is buffered to be delivered, it is fine to be dropped because the only thing a CloseSend can do is terminate a stream, and the stream must have already been terminated.

If you run any drpc code with the build tag -tags=debug, it outputs a log of messages that were delivered. It looks something like

    STR[0xc0000a91e0][1]: CloseSend()
    MAN[0xc00011c000]: <s:1 m:1 kind:Invoke data:24>
    MAN[0xc00011c000]: <s:1 m:2 kind:Message data:2>
    STR[0xc000140000][1]: <s:1 m:2 kind:Message data:2>
    MAN[0xc0000fe6c0]: <s:1 m:1 kind:Message data:2>
    STR[0xc0000a91e0][1]: <s:1 m:1 kind:Message data:2>
    STR[0xc000140000][1]: CloseSend()
    MAN[0xc00011c000]: <s:1 m:3 kind:CloseSend data:0>
    STR[0xc000140000][1]: <s:1 m:3 kind:CloseSend data:0>
    MAN[0xc0000fe6c0]: <s:1 m:2 kind:CloseSend data:0>
    STR[0xc0000a91e0][1]: <s:1 m:2 kind:CloseSend data:0>
    STR[0xc000140000][1]: Close()
    STR[0xc0000a91e0][1]: Close()

Maybe that can be helpful in figuring out what's going on?

For some reason the test case fails in CI when it doesn't on my machine... I won't get around to investigate this further until sometime this weekend.

This is because the test is racy. Try running it locally with go test ./drpcconn -count=100 -failfast to get it to fail. The race is because the server portion can send a CloseSend, the client can receive it and upon receiving it transition into the Finished state (because there is no active read or write on the stream and both sides have signaled they will no longer send, so the stream is terminated), and the server test code has not called remoteClose yet, so the connection itself is not closed.

I don't think there's a way to fix this. Fundamentally, the underlying transport being closed is asynchronous with respect to any messages being sent or received.

maxtruxa commented 3 years ago

The test would not be racy if I could find a way to assert what I really want to test.

I'll try to explain the issue with func (c *Conn) doInvoke a bit better.

When invoking an RPC, the conn sends a sequence of these frames: [InvokeMetadata] Invoke Message CloseSend. It then waits for the server to respond with a Message and exits. But the server doesn't send just a Message, but CloseSend as well. If the conn doesn't wait for the stream to really finish (as indicated by CloseSend) the last frame is stuck in the manager's queue, because there is no active stream anymore. This prevents the manager from detecting that the transport was closed. The manager is only unblocked when a new stream is created.

maxtruxa commented 3 years ago

I rebased on top of #17 and used <-conn.Closed() to wait for the connection to close. This removes the race condition. Without the change to doInvoke the test would hang until go test decides to kill it. To prevent long hangs in case of a broken test, the channel read is aborted after one second. On my machine 10000 iterations take around 400 ms, so that timeout should be large enough to prevent false negatives.

zeebo commented 3 years ago

When invoking an RPC, the conn sends a sequence of these frames: [InvokeMetadata] Invoke Message CloseSend. It then waits for the server to respond with a Message and exits. But the server doesn't send just a Message, but CloseSend as well. If the conn doesn't wait for the stream to really finish (as indicated by CloseSend) the last frame is stuck in the manager's queue, because there is no active stream anymore. This prevents the manager from detecting that the transport was closed.

Ah, I was thinking of the server manager and not the client manager. That makes sense. I think I'm going to look into doing a more general fix at the manager level where it will either continue to deliver or discard stream messages even after the stream is terminated. There's some subtleties with that, and it may not be easy to do, so this is a great thing to merge in the mean time.

I'd love to help out with getting this passing CI by fixing some of the linting failures and stuff, but I don't understand github. Normally, I do all the code reviews and changes through storj's gerrit instance (https://review.dev.storj.io). It's totally fine to use either one (we have bi-directional mirroring set up) and I'm happy to review on either. How do people normally do this sort of thing on github, though? lol

maxtruxa commented 3 years ago

No worries. I fixed the linter errors.

You can review PRs on the "files changed" tab. To comment on a particular line, hover over the line and click on the blue plus sign that appears next to the line. You can approve the PR or request changes from that page as well ("review changes" in the top right). I always create PRs with the "allow edits from maintainers" flag set, so - since you have write access to this repo - you should have write access to the PR branch on my fork. Feel free to push any desired changes.

Regarding changes to the manager: IMO the root cause of this issue is automatically removed once #14 is resolved. Since the manager (or some sort of stream multiplexer) would have to know about which streams are currently active to dispatch frames to them, implementing logic like "this stream doesn't exist; discard the frame" would be appropriate and prevent the manager from ever blocking like this.

zeebo commented 3 years ago

I always create PRs with the "allow edits from maintainers" flag set

Ah, this was what I was missing. Thanks for letting me know!

Since the manager (or some sort of stream multiplexer) would have to know about which streams are currently active to dispatch frames to them, implementing logic like "this stream doesn't exist; discard the frame" would be appropriate and prevent the manager from ever blocking like this.

Agreed if the multiplexing was internal, but I don't think I want to go that direction. That's a good way of thinking about it, though. It's like that but it'll only have to worry about a single stream existing. :smile: A weird thing to consider is that currently a server could theoretically presend responses to the next RPC a client might make. There's no way that is a useful thing to keep.

So at Storj we have a "two reviewers required" policy. I'm going to pester some folks to get this merged.