status-im / infra-eth-cluster

Infrastructure for Status-go fleets
https://github.com/status-im/status-go
0 stars 0 forks source link

Research connectivity issues on 2021/01/06 #38

Closed jakubgs closed 3 years ago

jakubgs commented 3 years ago

Yesterday we had few users reporting Connecting... errors in the app. Specifically the following nodes were mentioned:

Restarting the containers seemingly fixes the issue. At least temporarily.

jakubgs commented 3 years ago

I checked logs on `` for 6th and found only small number of errors:

ERROR[01-06|11:27:44.472] [mailserver:DeliverMail] error sending historic message response err="could not find peer with ID: daf3afa3943130329d9ea9b5426e3c70a9ad4607cfc144217fb7c6e1025b7de7" peerID=0xdaf3afa3943130329d9ea9b5426e3c70a9ad4607cfc144217fb7c6e1025b7de7 requestID=0x3afe313bc56ced8ae9418b473463be9a26d50fcda344080fc29d0e0e31bc84cd
ERROR[01-06|16:36:43.554] [mailserver:DeliverMail] error while processing err="p2p: read or write on closed message pipe" peerID=0xeb890aab29623d0d8a4a2d4091eaa9ae0b0dd1060a739acc98ab96578eaa4880 requestID=0x530b4fbd64e02dd1c0968431fdd9dec6376ef81d39cd34cb8ebe9ab716ff2c76
ERROR[01-06|16:36:43.555] Error while reporting error response     err="could not find peer with ID: eb890aab29623d0d8a4a2d4091eaa9ae0b0dd1060a739acc98ab96578eaa4880" peerID=0xb890aab29623d0d8a4a2d4091eaa9ae0b0dd1060a739acc98ab96578eaa4880
ERROR[01-06|19:00:11.108] [mailserver:DeliverMail] error while processing err="p2p: read or write on closed message pipe" peerID=0x9c6af51b483cef7591a7110144dc1b5bd8710f3639d231ee0d9245d9e017f7cb requestID=0xcaa996d1916f382eeb8d6e6c5114397608f0f5de3b2c6a679f804ec89e979f60
ERROR[01-06|19:00:11.109] Error while reporting error response     err="could not find peer with ID: 9c6af51b483cef7591a7110144dc1b5bd8710f3639d231ee0d9245d9e017f7cb" peerID=0x9c6af51b483cef7591a7110144dc1b5bd8710f3639d231ee0d9245d9e017f7cb
ERROR[01-06|21:15:34.964] [mailserver:DeliverMail] error while processing err="p2p: read or write on closed message pipe" peerID=0x237a8d7184587f8f045b5129ed7090285d59e44a94ca1bfe038b325ca71e60e8 requestID=0x54cb5094b3957c825bd8303c702a1ea4d5503f962fac2aa0594ce467ad2845e5
ERROR[01-06|21:15:34.965] Error while reporting error response     err="could not find peer with ID: 237a8d7184587f8f045b5129ed7090285d59e44a94ca1bfe038b325ca71e60e8" peerID=0x237a8d7184587f8f045b5129ed7090285d59e44a94ca1bfe038b325ca71e60e8
ERROR[01-06|22:20:57.061] [mailserver:DeliverMail] error while processing err="p2p: read or write on closed message pipe" peerID=0xa8a1839aa4278ed48fcd2d552ec53a46f3410aa48df524b297eeb6ce32a2ee5c requestID=0x9684456cfcb579fee84227ff3b5769a271a5365a1e9dc318c18f953b061cda15
ERROR[01-06|22:20:57.062] Error while reporting error response     err="could not find peer with ID: a8a1839aa4278ed48fcd2d552ec53a46f3410aa48df524b297eeb6ce32a2ee5c" peerID=0xa8a1839aa4278ed48fcd2d552ec53a46f3410aa48df524b297eeb6ce32a2ee5c
ERROR[01-06|23:39:19.433] [mailserver:DeliverMail] error sending historic message response err="p2p: read or write on closed message pipe" peerID=0xe8de5185afb35ab5a1cb8afbec90d985f15c002eafe1526af13fbc374158dbdc requestID=0x63274b83e597f017f2ce38f5126c9453149ad160b30923a7e4e292707766c48a
jakubgs commented 3 years ago

Actually, there's much more errors, but they appear as INFO log level:

INFO [01-06|23:59:24.468] failed to read a message       peerID=0x258b376f9a35240078daf6be5487006924abfa6219e6e1ca4624ae8d434d7037 error="p2p: read or write on closed message pipe"
INFO [01-06|23:59:24.469] handled peer                   peerID=0x258b376f9a35240078daf6be5487006924abfa6219e6e1ca4624ae8d434d7037 error="p2p: read or write on closed message pipe"

Which is clearly an issue, and I've counted 3325 read or write on closed message pipe errors.

jakubgs commented 3 years ago

There's also a bunch of these:

WARN#011pushnotificationserver/server.go:482#011registration did not validate#011{"error": "invalid version"}

Specifically 970 of those.

jakubgs commented 3 years ago

Also a bunch of these:

DEBUG#011common/message_processor.go:393#011failed to handle an encryption message#011{"site": "handleMessages", "hash": "\u0003
\ufffd\ufffd\u0012x\ufffd~\ufffd\ufffdxJOt\ufffd\ufffd\ufffd(\ufffd\u0000d,\ufffd\u00039|\u0000\ufffd\ufffd", "error": "failed to process an encrypted message: failed to unmarshal ProtocolMessage: proto: field \"encryption.ProtocolMessage.installation_id\" contains invalid UTF-8", "errorVerbose": "proto: field \"encryption.ProtocolMessage.installation_id\" contains invalid UTF-8
failed to unmarshal ProtocolMessage
github.com/status-im/status-go/protocol/v1.(*StatusMessage).HandleEncryption
  /go/src/github.com/status-im/status-go/protocol/v1/status_message.go:124
github.com/status-im/status-go/protocol/common.(*MessageProcessor).handleEncryptionLayer
  /go/src/github.com/status-im/status-go/protocol/common/message_processor.go:440
github.com/status-im/status-go/protocol/common.(*MessageProcessor).HandleMessages
  /go/src/github.com/status-im/status-go/protocol/common/message_processor.go:391
github.com/status-im/status-go/protocol.(*Messenger).handleRetrievedMessages
  /go/src/github.com/status-im/status-go/protocol/messenger.go:2178
github.com/status-im/status-go/protocol.(*Messenger).RetrieveAll
  /go/src/github.com/status-im/status-go/protocol/messenger.go:2113
main.retrieveMessagesLoop
  /go/src/github.com/status-im/status-go/cmd/statusd/main.go:402
runtime.goexit
  /usr/local/go/src/runtime/asm_amd64.s:1357
failed to process an encrypted message
github.com/status-im/status-go/protocol/common.(*MessageProcessor).handleEncryptionLayer
  /go/src/github.com/status-im/status-go/protocol/common/message_processor.go:449
github.com/status-im/status-go/protocol/common.(*MessageProcessor).HandleMessages
  /go/src/github.com/status-im/status-go/protocol/common/message_processor.go:391
github.com/status-im/status-go/protocol.(*Messenger).handleRetrievedMessages
  /go/src/github.com/status-im/status-go/protocol/messenger.go:2178
github.com/status-im/status-go/protocol.(*Messenger).RetrieveAll
  /go/src/github.com/status-im/status-go/protocol/messenger.go:2113
main.retrieveMessagesLoop
  /go/src/github.com/status-im/status-go/cmd/statusd/main.go:402
runtime.goexit
  /usr/local/go/src/runtime/asm_amd64.s:1357"}

Exactly 815 on that day.

Could this be the issue @cammellos ?

jakubgs commented 3 years ago

In relation to https://github.com/status-im/infra-eth-cluster/issues/38#issuecomment-756049471 this probably should be an error. Or at least a warning? https://github.com/status-im/status-go/blob/7387049d4b524d3371966fa07ccba3e08b6c652a/waku/v0/peer.go#L188 https://github.com/status-im/status-go/blob/7387049d4b524d3371966fa07ccba3e08b6c652a/waku/v1/peer.go#L188

jakubgs commented 3 years ago

In regard to https://github.com/status-im/infra-eth-cluster/issues/38#issuecomment-756057815 this seems to be using a different logger spawned from the normally used logger: https://github.com/status-im/status-go/blob/7387049d4b524d3371966fa07ccba3e08b6c652a/protocol/common/message_processor.go#L396 https://github.com/status-im/status-go/blob/7387049d4b524d3371966fa07ccba3e08b6c652a/protocol/common/message_processor.go#L385 Which is why the DEBUG level message still appears despite level being set to INFO.

jakubgs commented 3 years ago

If I look for handleMessages I see a major spike in those logs at 3PM UTC: image https://kibana.infra.status.im/goto/c5cd6c8b8196cb05937ceb2e95fa70c4

And if we look at distribution by logsource we can see that mail-01.gc-us-central1-a-eth.prod is a major culprit: image

jakubgs commented 3 years ago

And in metrics we can see that around 3PM there was a spike in delivery errors as well:

image

jakubgs commented 3 years ago

As far as I can tell the issue was caused by accidental deployment of some port mapping changes to prod.

For more details see: https://github.com/status-im/infra-eth-cluster/issues/35#issuecomment-756159958