Expose metrics about HTTP/2 flow control

anuraaga commented 4 years ago

When debugging issues related to bandwidth of high-throughput streams, it would be useful to be able to see metrics about HTTP/2 flow control, for example what percent of the time the stream isn't writable, and perhaps debug logging of window change events.

/cc @davidk81

amitvc commented 4 years ago

@anuraaga If you are not already working on it can I take this issue? Would you be able to give me a few pointers where to start looking for this? I am relatively new to the project

anuraaga commented 4 years ago

Thanks! I haven't explored this well yet but am helping troubleshoot low bandwidth in high latency situations which is often because of flow control.

The class of interest is this one

https://netty.io/4.1/api/io/netty/handler/codec/http2/Http2RemoteFlowController.html

Window size, and hooking writability into a gauge seem useful.

One location that accesses flow control on the server is here

https://github.com/line/armeria/blob/master/core/src/main/java/com/linecorp/armeria/server/HttpServerHandler.java#L291

Hope this helps to get started!

davidk81 commented 4 years ago

Not directly related to exposing http/2 flow control metrics, but I was able to do some troubleshooting with wireshark and gain some insight related to latency-throughput issues when using grpc.

In my streaming tests, I was able to increase throughput from 110Mbps to 300Mbps using just regular grpc-java client/server library (have not yet tested with armeria/grpc). My RTT between client and server is around 220ms.

These were the settings i played around with, the values are most likely overkill and I have not yet tested which ones were not needed or can be lowered without sacrificing throughput.

echo 'net.core.wmem_max=64000000' >> /etc/sysctl.conf
echo 'net.core.rmem_max=64000000' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem= 4000000 8000000 64000000' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem= 4000000 8000000 64000000' >> /etc/sysctl.conf
echo 'net.core.netdev_max_backlog = 20000' >> /etc/sysctl.conf

(settings for grpc-java, both client and server config changed where applicable)

        public static final int BUFFER_SIZE = 64000000;
        ManagedChannel channel = NettyChannelBuilder
                .forAddress(host, port)
                .usePlaintext()
                .maxInboundMessageSize(GrpcExampleServer.BUFFER_SIZE)
                .initialFlowControlWindow(GrpcExampleServer.BUFFER_SIZE)
                .flowControlWindow(GrpcExampleServer.BUFFER_SIZE)
                .retryBufferSize(GrpcExampleServer.BUFFER_SIZE)
                .perRpcBufferLimit(GrpcExampleServer.BUFFER_SIZE)
                .build();
        ExampleServiceGrpc.ExampleServiceStub service = ExampleServiceGrpc
                .newStub(channel)
                .withMaxOutboundMessageSize(GrpcExampleServer.BUFFER_SIZE)
                .withMaxInboundMessageSize(GrpcExampleServer.BUFFER_SIZE);

trustin commented 4 years ago

Thanks a lot for sharing your findings, @davidk81! I'm curious if we can replicate the same result when using Armeria.

davidk81 commented 4 years ago

@trustin @anuraaga It seems armeria's built in flowcontrol already scales pretty well. I just had to update the sysctl values on both the client & server.

I was able to set up a test env using armeria/grpc, I was able to achieve over 400Mbps (up from ~130Mbps) by increasing the sysctl configs per my last post. The values I have set arbitrarily are around 10x the default values.

my armeria/grpc settings were basically left at default values (which already has automatic BDP I believe), so I think it is not necessary to configure the Netty flow control / buffer size settings.

davidk81 commented 4 years ago

some more testing updates, again, not directly relevant to op, but hopefully helps someone out.

when troubleshooting latency-throughput issues with grpc, I was able get over 3x speed up (110Mps -> 360Mbps) by applying these settings to the host OS on client & server. The RTT between them is ~230ms. The test was done with 1GB file transfer, and throughput ramp-up was observed probably due to tcp & grpc BDP window updates.

# on agent sending data
sudo echo 'net.ipv4.tcp_wmem= 4096 16384 25165824' >> /etc/sysctl.conf
sudo echo 'net.core.wmem_max=1048576' >> /etc/sysctl.conf
sudo sysctl -p

# on agent receiving data
sudo echo 'net.ipv4.tcp_rmem= 4096 131072 25165824' >> /etc/sysctl.conf
sudo sysctl -p

davidk81 commented 4 years ago

@anuraaga @trustin Hi i'm back, i've encountered another performance related issue, although not the same as the original post; pls let me know if I should post a new issue or if theres any tips to troubleshoot.

we are using armeria 0.99.2

The issue I am now facing is something I have coined "streaming throughput coupling".
The symptom is when multiple GRPC streams are transferring data from same server pod to different clients, there is good probability that all streams will transfer data at the same data rate.

ie: i have tested this in prod-like environment using 1 streaming-server pod & multiple simultaneous client streams:

1 client on GCP Tokyo region (single stream max throughput is > 300Mbps)
1 client on GCP Frankfurt region (single stream max throughput is ~ 110Mbps, due to network latency)

when both clients stream at the same time, there is good likelihood (>5%) that both streams will yield the 'lower' throughput of 110mbps.

To make matters worse, when one stream disconnects mid-stream, the other stream also gets disconnected by the server. (edit: other stream disconnects typically after ~10 sec)

looking for any tips to troubleshoot this issue, not sure if this will help me, maybe:

enable debug logging on armeria?
enable verbose logging on netty?

I realize this is a complicated to analyze, for now I've ruled out possibility of VM/network issues (everything is on GCP cloud). I've also tested this using default sysctl TCP buffer size settings (4MB max), and i've set up the server pod in a way that runs in its own GKE node to minimize possibility of 'outside' interferences.

thanks so much in advance! tagging @danada

anuraaga commented 4 years ago

Could you provide more detail about the network topology?

HTTP load balancer vs TCP load balancer
Server is directly serving traffic from clients or any other servers in between doing internal RPC?

Could be good to verify the metrics for armeria's active connections (beware of health check) - it would be extremely strange behavior if two connections but presumably some hop along the way is multiplexing onto one. Doesn't seem like that should cause the problem anyways but good to confirm.

davidk81 commented 4 years ago

Thanks for the prompt reply @anuraaga

GCP TCP loadbalancer is used for this test (I have yet to test this using HTTP LB endpoint), TLS terminated at our "api" service pod which is a passthrough to the "streamer-server". Both pods are fixed to run on the same dedicated GKE Node. No other services are running on the Node except for daemonsets (but does include our standard monitoring services; fluentd, prometheus, etc).

The grpc-go client (message consumer) is also running directly on GCP VM, and attached to same VPC as our server (in both same and different regions).

Additionally, the same behavior happens for our client user who tested from different GCP account as well as from their in-house network.

I have also tried to analyze the tracing profiler output of our grpc-go client, and it seems a lot of time is spent waiting for data to arrive from the server side. Hence I'm now trying get some visibility into the transport stack on the server side to gain more insight.

I'd like to add another symptom; we have implemented a 'client message ack' protocol, and our server-side logs show that when throughput is lower than expected, the 'client acks' takes significantly longer (up to 20 seconds longer in some cases) to be received by our server code.

eg: server -> (msg id, data) -> client : start time is TS of message 'handed off' to server-side armeria stream-observer client -> (msg id ack) -> server : end time is TS fo when message is received on server-side by our in-house stream-observer. (up to 20 seconds after start time in extreme cases)

Thus, I'm currently trying to find what's happening within that "20 seconds" for one of the streams that should be capable of much higher throughput.

hope that is clear to understand, thanks again!

edit: I have updated previous post to clarify about "coupled stream" disconnects - other stream disconnects typically after ~10 sec

anuraaga commented 4 years ago

@trustin @minwoox @ikhoon any pointers on how to debug this? I imagine it must be an issue with two streams on one connection, maybe if there are ideas on where we can add more logging or metrics to understand this. Would be cool if Armeria can handle these hundred mbps streams :)

minwoox commented 4 years ago

I have no idea yet. 🤣 So the client ack is implemented using bidi streaming and it took 20 seconds from the client to the server?

trustin commented 4 years ago

I have no idea, but perhaps we can replicate the problem locally by using some bandwidth-limiting middleman?

@davidk81 Did you increase the HTTP/2 connection window size as well for Armeria clients and servers? If not, what happens if you increase the connection window to a larger value? You can do that programmatically by calling http2InitialConnectionWindowSize() in ClientFactoryBuilder and ServerBuilder.

davidk81 commented 4 years ago

@anuraaga regarding your comment "two streams on one connection" - is it possible for this to happen even though the two streams originate from different GCP VMs? I can confirm the streams are from 2 different TCP connections from client's perspective, but maybe you are talking about different connection concept here (within Armeria?).

@minwoox correct, client ack was implemented using bidi streaming. it was implemented for us to keep track of "last message received successfully by client" so we can resume a stream without data loss. -- This might be a topic, but we are also using this to limit the amount of data "on the wire" at high throughputs to better manage memory usage by the JVM (simply checking "isReady" seem to cause a few hundred MB of data "buffered" in our case). Currently we are soft-limiting 32MB of data (per stream) "on the wire" which can achieve 400+MBs over short RTT network latencies.

@trustin We did not change http2InitialConnectionWindowSize or any other window sizes for this test. Also the kernel's sysctl TCP rmem & wmem was also at default value 4MB. Is the http2InitialConnectionWindowSize setting shared across GRPC connections? and would changing that value help?

Aside from the throughput issue, the other biggest mystery I have is why would disconnecting one one GRPC connection cause a different GRPC connection (originating from another VM) to disconnect. I'm trying to figure out if this is an OS/Kernel-level issue or something happening within Armeria.

Thanks all again for following up. I will try to replicate this in a standalone server/client example, I was hoping to try to get some more confirmation of how likely this is caused by Armeria before investing time to design and code a standalone experiment.

anuraaga commented 4 years ago

@davidk81

two streams on one connection" -

Yeah was referring to TCP connection. From what I can tell, even with two connections to the API server, it can be just one connection between the API server and the streamer server right? It'd be nice to confirm if this is happening maybe by looking at Armeria's active request metrics. If two TCP connections can produce coupling behavior that's extremely surprising. But if multiplexed on one connection, it seems conceivable that it's an issue with the stream handling in Armeria.

davidk81 commented 4 years ago

@anuraaga I wasn't aware of this possibility, thanks for mentioning it and i'll try to see if this is happening.

trustin commented 4 years ago

Is the http2InitialConnectionWindowSize setting shared across GRPC connections?

It's not shared between connections. Each connection has its own connection window size.

and would changing that value help?

I'm not 100% sure, but thought it may be worth trying.

If two TCP connections can produce coupling behavior that's extremely surprising.

+1

davidk81 commented 4 years ago

@anuraaga studying the metrics it seems both "data consumer" GRPC streams do indeed share the same backend TCP connection via the API service which is something I was not aware of before; thanks for the insight. This would explain some degree of throughput and disconnect coupling we are experiencing.

eg: grpc-client-1 ---+--- API Server --- Data Streamer Server grpc-client-2 ---/

interesting regarding "coupled disconnect", when first consumer stream (eg: grpc-client-1) disconnects willingly, the 2nd consumer stream (grpc-client-2) gets forcefully disconnected by the API almost exactly 30secs later:

14:51:49.437 --- grpc-client-1 (grpc-go) disconnects via ctrl-c 14:52:19.473 --- grpc-client-2 gets disconnected by API

this appears to be the exception thrown during the forceful disconnect

com.linecorp.armeria.common.ClosedSessionException: null
    at com.linecorp.armeria.common.ClosedSessionException.get(ClosedSessionException.java:36) ~[armeria-0.99.2.jar:?]
    at com.linecorp.armeria.client.HttpSessionHandler.getPendingException(HttpSessionHandler.java:371) ~[armeria-0.99.2.jar:?]
    at com.linecorp.armeria.client.HttpSessionHandler.channelInactive(HttpSessionHandler.java:336) [armeria-0.99.2.jar:?]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) [netty-transport-4.1.48.Final.jar:4.1.48.Final]

trustin commented 4 years ago

Two different connections (grpc-client-1 and grpc-client-2) should never be affected by each other in my opinion. Is the 'API server' an Armeria server as well? Which is initiating the disconnection from grpc-client-2, 'API server' or 'Data streamer server'?

davidk81 commented 4 years ago

@trustin the API is also Armeria service, I'm not too familiar with it yet on how the two streams are multiplexed - i'll look into this as well.

also, i'm not sure how to determine which service initiated the disconnect. here is log from the "Streamer Server" but note the timestamp may not be indicative of sequence of events (pods on different GKE nodes at the moment). I'll try to do some testing in a more controlled env when i get the chance to.

2020-09-30 14:52:19.480

com.linecorp.armeria.common.stream.CancelledSubscriptionException, scheme=gproto+h2, name=<redacted>/OpenStream, headers=[:method=POST, :path=/<redacted>/OpenStream, :authority=streamer-server-qa.telemetry-prod.svc.cluster.local:8080, :scheme=https, content-type=application/grpc, user-agent=grpc-go/1.29.1

...

javax.net.ssl.SSLException: SSLEngine closed already
    at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:829) ~[netty-handler-4.1.48.Final.jar:4.1.48.Final]
    at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:792) ~[netty-handler-4.1.48.Final.jar:4.1.48.Final]
    at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:773) ~[netty-handler-4.1.48.Final.jar:4.1.48.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) ~[netty-transport-4.1.48.Final.jar:4.1.48.Final]

(edit) Just to make clear, the "Streamer Server" pod did not restart. Our client (eg: grpc-client-2) has built-in auto reconnect, and it as able to re-establish connection immediately to the same API & streamer pod to complete the data stream.

anuraaga commented 4 years ago

@trustin By the way this might or might not be a hint, but I guess the gateway is using our standard "one line proxy" approach of passing the incoming HttpRequest to the call to the backend WebClient.execute. Is it possible that when the client closes the connection with the gateway, that also ends up with closing the connection to the backend, instead of only the HTTP/2 stream, since it somehow forwards that "as-is"?

Also looks like it is TLS both to the gateway and to the backend.

anuraaga commented 4 years ago

@davidk81 Also, while there does seem to be something wrong going on here that we should get to the bottom of, if you need a quick workaround, I think increasing the number of event loops per endpoints can help

https://github.com/line/armeria/blob/master/core/src/main/java/com/linecorp/armeria/client/ClientFactoryBuilder.java#L160

For HTTP/2, we usually use just one since multiplexing has less overhead. But it doesn't seem to be working in this case properly so if you increase that number I think it will increase the number of connections to the backend and avoid coupling.

trustin commented 4 years ago

@anuraaga An aborted stream only aborts a stream in HTTP/2. Armeria client never closes a connection except when it's idle or its ClientFactory is closed. I would be very surprised if an aborted stream closes a connection.

davidk81 commented 4 years ago

thanks everyone for the input; i have done more testing based on additional insights.

1. regarding coupled disconnect this could be caused by our in-house code due to inadequate blocking on event-loop

using same test as before: grpc-client-1 ---+--- API Server (event-loop-1) --- Data Streamer Server (event-loop-2) grpc-client-2 ---/

After grpc-client-1 disconnects, it appear to cause event-loop-2 on streamer to shutdown (but only for high throughput streams), leading to disconnection of grpc-client-2. we have some code that blocks on the "streamer" when a client disconnects (pending fix). I don't have solid evidence yet of event-loop-2 being shutdown (no logs), but every subsequent reconnect seem to land on a different event loop leading me to suspect the original event loop is unavailable.

for slow streams (<40 mbps) the disconnects does not seem to happen, instead, there is sometimes a pause in streaming when the other stream disconnects (likely due to event loop blocking).

@anuraaga I haven't tested changing the number of event-loops yet as it seems it's not trivial to change this in our current in-house code.

2. regarding coupled throughput it seems when both grpc-client-1 & 2 connect via event-loop-1 on api and event-loop-2 on streamer, the throughput is 'coupled' to match the lower "potential" of the 2 streams. just a wild guess - is it possible for armeria's window sizing algorithm to adapt only to the slowest stream? one client has ~ 230ms network latency and the other <50ms.

again, i did not test the case yet when both clients connect to different event-loops within api and streamer, i'll try to do that when i get the chance.

thanks again everyone!

line / armeria

Expose metrics about HTTP/2 flow control #2738