Add a keepalive or ping/pong to the websocket connection

dehort commented 4 years ago

We have seen an issue where the websocket connection appears to be getting severed by a proxy that sits between the receptor node and cloud based receptor-controller. In this situation, the receptor-controller notices that the connection has been severed because it gets a read timeout on ping/pong messages that its sending. However, the receptor node appears to stay connected to the proxy even though the connection has been severed behind the proxy.

Receptor can likely detect this type of issue by implementing a ping/pong itself.

dehort commented 4 years ago

When I opened this, I was thinking of having a websocket specific ping by (possibly) enabling the "heartbeat" option on the aiohttp websocket client. However, I didn't realize there was a "keeplalive_internval" setting in receptor. We might be able to use this setting and not need a websocket specific ping/pong.

jhutar commented 4 years ago

Hello. Can it be related to https://github.com/project-receptor/receptor/issues/199 and/or https://github.com/RedHatInsights/platform-receptor-controller/issues/96 ?

dehort commented 4 years ago

@jhutar I do not think this issue is related to either of those issues. I'm not sure what caused #199, but I suspect the read buffer size is too small for that test in https://github.com/RedHatInsights/platform-receptor-controller/issues/96.

elyezer commented 4 years ago

I tried to verify this on both devel and release_0.6 branches and was able to see that the configuration was being used as expected. Did the following:

Spawn a 2 node mesh with nodes named controller and node-a:

$ poetry run receptor --debug --node-id=controller -d /tmp/controller node --listen=ws://127.0.0.1:9999 --ws_heartbeat=1

$ poetry run receptor --debug --node-id=node-a -d /tmp/node-a node --listen=ws://127.0.0.1:9998 --peer=ws://localhost:9999 --ws_heartbeat=5

Then used `tcpdump to verify the heartbeat was sent:

$ sudo tcpdump -i lo -vv "dst port 9999"
dropped privs to tcpdump
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes

13:56:35.402543 IP (tos 0x0, ttl 64, id 27428, offset 0, flags [DF], proto TCP (6), length 58)
    localhost.32980 > localhost.distinct: Flags [P.], cksum 0xfe2e (incorrect -> 0x02fe), seq 177860913:177860919, ack 136016338, win 512, options [nop,nop,TS val 2079655159 ecr 2079649677], length 6
13:56:35.403283 IP (tos 0x0, ttl 64, id 27429, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.32980 > localhost.distinct: Flags [.], cksum 0xfe28 (incorrect -> 0x573c), seq 6, ack 3, win 512, options [nop,nop,TS val 2079655160 ecr 2079655160], length 0

13:56:41.402960 IP (tos 0x0, ttl 64, id 27430, offset 0, flags [DF], proto TCP (6), length 58)
    localhost.32980 > localhost.distinct: Flags [P.], cksum 0xfe2e (incorrect -> 0x115c), seq 6:12, ack 3, win 512, options [nop,nop,TS val 2079661160 ecr 2079655160], length 6
13:56:41.403603 IP (tos 0x0, ttl 64, id 27431, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.32980 > localhost.distinct: Flags [.], cksum 0xfe28 (incorrect -> 0x2854), seq 12, ack 5, win 512, options [nop,nop,TS val 2079661160 ecr 2079661160], length 0

13:56:47.402042 IP (tos 0x0, ttl 64, id 27432, offset 0, flags [DF], proto TCP (6), length 58)
    localhost.32980 > localhost.distinct: Flags [P.], cksum 0xfe2e (incorrect -> 0x5a2b), seq 12:18, ack 5, win 512, options [nop,nop,TS val 2079667159 ecr 2079661160], length 6
13:56:47.402808 IP (tos 0x0, ttl 64, id 27433, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.32980 > localhost.distinct: Flags [.], cksum 0xfe28 (incorrect -> 0xf96b), seq 18, ack 7, win 512, options [nop,nop,TS val 2079667160 ecr 2079667160], length 0

Then I played with the ws_heartbeat option and the configure time was properly used.

With all that said we can consider this as being verified.

project-receptor / python-receptor

Add a keepalive or ping/pong to the websocket connection #210