status-im / status-go

The Status module that consumes go-ethereum
https://status.im
Mozilla Public License 2.0
728 stars 245 forks source link

SpiffWorkflow node not sending messages #5195

Closed jakubgs closed 4 months ago

jakubgs commented 4 months ago

Problem

The SpiffWorkflow node appears to have issues sending messages. Spiff shows errors like:

{
  "level": "ERROR",
  "message": "Error executing Service Task. Received error code ReadTimeout from service waku/SendMessage ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host=test.app.spiff.status.im, port=443): Read timed out. (read timeout=45). ",
  "loggerName": "spiffworkflow_backend",
  "processName": "MainProcess",
  "processID": 36,
  "threadName": "ThreadPoolExecutor-0_9",
  "threadID": 139861463910080,
  "timestamp": "2024-05-20T13:11:21.225Z"
}

But debugging this is difficult because as far as I can tell 98% of the node logs are errors:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep -v ERROR /var/log/docker/spiff-workflow-wakunode/docker.* | wc -l
267247
jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep ERROR /var/log/docker/spiff-workflow-wakunode/docker.* | wc -l 
17513290
jakubgs commented 4 months ago

Error reported from UI is:

"message_text": "There was an exception running process model manage-procurement/procurement/requisition-order-management/request-goods-services for instance 3256.\nOriginal Error:\nWorkflowExecutionServiceError("Error executing Service Task. Received error code 'ReadTimeout' from service 'waku/SendMessage' ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host='test.app.spiff.status.im', port=443): Read timed out. (read timeout=45)")"

And if we look at the time when this was generated:

image

Nothing in backend-api logs can be found, and in connector logs only this:

image

Utterly useless.

jakubgs commented 4 months ago

This is the only thing I can find that matches:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep '07:48:2[23]' /var/log/docker/spiff-workflow-wakunode/docker.log
ERROR[05-21|07:48:22.564|github.com/status-im/mvds/node/node.go:364]                                         error processing acks                    namespace=mvds error="state not found"
ERROR[05-21|07:48:23.272|github.com/status-im/mvds/node/node.go:446]                                         Error while removing sync state.         namespace=mvds error="state not found"
ERROR[05-21|07:48:23.272|github.com/status-im/mvds/node/node.go:364]                                         error processing acks                    namespace=mvds error="state not found"
ERROR[05-21|07:48:23.276|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|07:48:23.276|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
jakubgs commented 4 months ago

Error appears to becoming from here: https://github.com/status-im/status-go/blob/6f1b82966a12ca40e6f4e5b3991f17fa24690404/protocol/messenger_peersyncing.go#L359-L364 And more specifically from: https://github.com/status-im/status-go/blob/6f1b82966a12ca40e6f4e5b3991f17fa24690404/protocol/common/message_segmentation.go#L35-L37 If it is indeed the correct error I'm researching, considering most of logs is errors.

jakubgs commented 4 months ago

The logs are full of all kinds of other errors:

ERROR[05-21|08:14:26.725|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:26.725|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:30.276|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:30.276|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:30.496|github.com/status-im/status-go/protocol/pushnotificationclient/client.go:900]       could not retrieve message               error="record not found"
ERROR[05-21|08:14:36.318|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:14:39.127|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:14:39.127|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:14:39.668|github.com/status-im/status-go/protocol/pushnotificationclient/client.go:900]       could not retrieve message               error="record not found"

No idea how relevant or if it's just noise.

jakubgs commented 4 months ago

There's also shit like this:

ERROR[05-21|08:15:22.523|rpc/service.go:200]                                                                 RPC method wakuext_addContact crashed: runtime error: slice bounds out of range [:2] with length 1
goroutine 322642907 [running]:
github.com/ethereum/go-ethereum/rpc.(*callback).call.func1()
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/service.go:199 +0x89
panic({0x278d760, 0xc04fc28558})
#011/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/status-im/status-go/api/multiformat.multibaseDecode({0x4864f90?, 0xc000f673b8?})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:242 +0x7f
github.com/status-im/status-go/api/multiformat.DeserializePublicKey({0x4864f90?, 0x483f0a0?}, {0x3145710, 0x1})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:50 +0x35
github.com/status-im/status-go/api/multiformat.DeserializeCompressedKey({0x4864f90?, 0x0?})
#011/go/src/github.com/status-im/status-go/api/multiformat/utils.go:87 +0x36
github.com/status-im/status-go/protocol/requests.ConvertCompressedToLegacyKey(...)
#011/go/src/github.com/status-im/status-go/protocol/requests/send_contact_request.go:35
github.com/status-im/status-go/protocol/requests.(*AddContact).HexID(0x1de8e002658780?)
#011/go/src/github.com/status-im/status-go/protocol/requests/add_contact.go:26 +0x34
github.com/status-im/status-go/protocol.(*Messenger).AddContact(0x2694a60?, {0x316b058, 0xc036a67d10}, 0xc036a67d60)
#011/go/src/github.com/status-im/status-go/protocol/messenger_contacts.go:650 +0x85
github.com/status-im/status-go/services/ext.(*PublicAPI).AddContact(...)
#011/go/src/github.com/status-im/status-go/services/ext/api.go:863
reflect.Value.call({0xc00054f740?, 0xc00052ae78?, 0x7fcd524912d8?}, {0x2902f65, 0x4}, {0xc036a67db0, 0x3, 0x0?})
#011/usr/local/go/src/reflect/value.go:586 +0xb07
reflect.Value.Call({0xc00054f740?, 0xc00052ae78?, 0x16?}, {0xc036a67db0?, 0x1?, 0x1?})
#011/usr/local/go/src/reflect/value.go:370 +0xbc
github.com/ethereum/go-ethereum/rpc.(*callback).call(0xc0007bb740, {0x316b058?, 0xc036a67d10}, {0xc04fc28540, 0x12}, {0xc05c6035a8, 0x1, 0x4e2997?})
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/service.go:205 +0x3c5
github.com/ethereum/go-ethereum/rpc.(*handler).runMethod(0xc05385bcc0?, {0x316b058?, 0xc036a67d10?}, 0xc011c65180, 0x1?, {0xc05c6035a8?, 0xc028a11cb0?, 0x90?})
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:391 +0x45
github.com/ethereum/go-ethereum/rpc.(*handler).handleCall(0xc028a11b00, 0xc06f78ef30, 0xc011c65180)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:336 +0x239
github.com/ethereum/go-ethereum/rpc.(*handler).handleCallMsg(0xc028a11b00, 0x30?, 0xc011c65180)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:297 +0x237
github.com/ethereum/go-ethereum/rpc.(*handler).handleMsg.func1(0xc06f78ef30)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:138 +0x35
github.com/ethereum/go-ethereum/rpc.(*handler).startCallProc.func1()
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:225 +0xc5
created by github.com/ethereum/go-ethereum/rpc.(*handler).startCallProc
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/rpc/handler.go:221 +0x8d
jakubgs commented 4 months ago

But I sent a test message to someone using this command:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "chats": [
      {
        "id": "0x04a0dcd3fc989e897089587d1d227c91d5c3898f4e74d72c6c076a18113d7c469d469f276ac599b5ce6c147a708689c3340d5cdd8e48c8dd6148477fbaec5d7942",
        "name": "0x04a0dc",
        "description": "",
        "color": "",
        "emoji": "",

And it worked. So the message delivery works fine.

jakubgs commented 4 months ago

But I did manage to get it to timeout a few times:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 5000 milliseconds with 0 bytes received

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % c -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 5001 milliseconds with 0 bytes received

Which suggests an issue indeed exists. But logs show me nothing useful.

jakubgs commented 4 months ago

And the host is definitely not overloaded:

image

jakubgs commented 4 months ago

What's interesting is I stopped the connector container which should be responsible for sending waku messages and I still see garbage like this in status-go go node logs:

ERROR[05-21|08:31:36.328|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:31:50.984|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:31:50.984|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:17.635|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"
ERROR[05-21|08:32:36.328|github.com/status-im/status-go/protocol/messenger_curated_communities.go:49]        failed to get curated communities from contract error="could not find network: 10"
ERROR[05-21|08:32:52.261|github.com/status-im/status-go/protocol/messenger_peersyncing.go:362]               payload is invalid 
ERROR[05-21|08:32:52.261|github.com/status-im/status-go/protocol/datasync/transport.go:58]                   failed to send message                   error="payload is invalid"

Which suggests both payload is invalid and could not find network: 10 are not caused by messages being sent.

jakubgs commented 4 months ago

There are lots of different types of errors:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep -v peerID /var/log/docker/spiff-workflow-wakunode/docker.log | awk '/ERROR/{$1=$2=$3=""; a[$0]++}END{for (x in a) { printf "%d -> %s\n", a[x], x}}' | sort -h | tail    
344 ->    failed to sync filter error=EOF
371 ->    error syncing topics error=EOF
517 ->    failed to get curated communities from contract error="could not find network: 10"
987 ->    could not retrieve message error="record not found"
1150 ->    failed to handle an encrypted message site=handleEncryptionLayer error="failed to handle Encryption message: can't skip previous chain message keys: bad until: probably an out-of-order message that was deleted"
1353 ->    failed to send message error="payload is invalid"
1353 ->    payload is invalid
1675 ->    error processing acks namespace=mvds error="state not found"
1675 ->    Error while removing sync state. namespace=mvds error="state not found"
9662 ->    failed to decode message err="couldn't decrypt using asymmetric key: ecies: invalid message"

No idea which are relevant or not. The fact that message delivery sometimes works suggests that most of those mean nothing.

jakubgs commented 4 months ago

Here's the stats but for source code file from where the error originates:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % grep -v peerID /var/log/docker/spiff-workflow-wakunode/docker.log | awk -F'[][|:]' '/ERROR/{a[$12]++}END{for (x in a) { printf "%d -> %s\n", a[x], x}}' | sort -h
19 -> github.com/status-im/status-go/protocol/messenger_raw_message_resend.go
27 -> github.com/status-im/status-go/protocol/messenger_mailserver_cycle.go
87 -> rpc/service.go
396 -> github.com/waku-org/go-waku/waku/v2/protocol/legacy_store/waku_store_client.go
524 -> github.com/status-im/status-go/protocol/messenger_curated_communities.go
999 -> github.com/status-im/status-go/protocol/pushnotificationclient/client.go
1208 -> github.com/status-im/status-go/protocol/common/message_sender.go
1366 -> github.com/status-im/status-go/protocol/messenger_peersyncing.go
1370 -> github.com/status-im/status-go/protocol/datasync/transport.go
1465 -> github.com/status-im/status-go/protocol/messenger_mailserver.go
3530 -> github.com/status-im/mvds/node/node.go
9815 -> github.com/status-im/status-go/wakuv2/common/message.go
jakubgs commented 4 months ago

I can reproduce the timeout even with 8 seconds limit:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:/docker/spiff-workflow % c --max-time 8 -XPOST 0:8545 -d '{"jsonrpc":"2.0","method":"wakuext_sendOneToOneMessage","params":[{"id":"zQ3shQsUWc9yQMba7M3vyJxuw1wLK6jcrg3YMshrvazA3iiGM","message":"TEST MESSAGE"}], "id": 1}' -H 'Content-type: application/json' | jq
curl: (28) Operation timed out after 8001 milliseconds with 0 bytes receive

But less frequently. More so with 7 seconds.

jakubgs commented 4 months ago

If we look at connector code we can see the request is made with requests.post() without any timeout specified:

            raw_response = requests.post(url, json.dumps(request_body), headers=headers)

https://github.com/status-im/spiffworkflow-connector/blob/9b01806811b295c52d50c8cb6ad95c1444fd208b/connectors/connector-waku/connector_waku/commands/sendMessage.py#L54

And according to docs that should mean indefinite timeout:

By default, requests do not time out unless a timeout value is set explicitly. Without a timeout, your code may hang for minutes or more.

https://requests.readthedocs.io/en/latest/user/advanced/#timeouts

Which would contradict my hypothesis. But there are other timeouts involved, like the frontend one.

jakubgs commented 4 months ago

It appears the Read timed out. (read timeout=45). comes from CONNECTOR_PROXY_COMMAND_TIMEOUT

CONNECTOR_PROXY_COMMAND_TIMEOUT = 45

https://github.com/sartography/spiff-arena/blob/822e4f499b90736d53483d96f6bea08628f3d0e6/spiffworkflow-backend/src/spiffworkflow_backend/config/__init__.py#L13

Which is then used by request.post():

proxied_response = requests.post(call_url, json=params, timeout=CONNECTOR_PROXY_COMMAND_TIMEOUT)

https://github.com/sartography/spiff-arena/blob/5840db56cfc471fff54930cf54880131f4586ea1/spiffworkflow-backend/src/spiffworkflow_backend/services/service_task_service.py#L201

cammellos commented 4 months ago

So far, it looks like the problem is CPU related. Usage is constantly around 100%, and the node seems to be stuck on a loop of fetching messages from store nodes (I can't tell yet if it's a loop or there's just a lot of data, investigating).

jakubgs commented 4 months ago

Andrea made a PR to reduce CPU usage by disabling fetching of messages:

But even that is not enough to prevent the status-go node for Spiff from abusing the CPU for no reason:

image

jakubgs commented 4 months ago

I have added extra logging to connector container code on app-test env:

diff --git a/connectors/connector-waku/connector_waku/commands/sendMessage.py b/connectors/connector-waku/connector_waku/commands/sendMessage.py
index 09a81bf..4d59e32 100644
--- a/connectors/connector-waku/connector_waku/commands/sendMessage.py
+++ b/connectors/connector-waku/connector_waku/commands/sendMessage.py
@@ -51,7 +51,10 @@ class SendMessage:
         status_code = None
         successful = False
         try:
+            print('POST', url, request_body)
             raw_response = requests.post(url, json.dumps(request_body), headers=headers)
+            print('RESPONSE:', raw_response.status_code)
+            print('RESPONSE TEXT:', raw_response.text)
             raw_response.raise_for_status()
             status_code = raw_response.status_code
             parsed_response = json.loads(raw_response.text)
@@ -59,9 +62,11 @@ class SendMessage:
             if not self.response_has_error(response) and status_code == 200:
                 successful = True
         except HTTPError as ex:
+            print('HTTPError:', ex)
             status_code = ex.response.status_code
             response['error'] = str(ex)
         except Exception as ex:
+            print('Exception:', ex)
             response['error'] = str(ex)
             status_code = 500
         return (response, status_code, successful)

It's not pretty but it workd. I want to see how it actually fails and what are the timings. So far I as unable to reproduce a failure.

jakubgs commented 4 months ago

I was unable to reproduce the issue using this process: https://test.app.spiff.status.im/process-models/misc:test-waku-kb

So I'm going to leave reproducing it with new logging to Spiff testers.

jakubgs commented 4 months ago

Some failed process instances were found on app-test:

jakubgs commented 4 months ago

What's fascinating is that if we look at https://test.app.spiff.status.im/i/637 which started at 08:18:57 and failed at 08:19:42 and we check connector logs at the start time we see:

[2024-05-27 06:18:57 +0000] [143966] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:19:43 +0000] [143697] [DEBUG] POST /connector-proxy/v1/do/http/GetRequest
172.17.3.1 - - [27/May/2024:06:19:43 +0000] "POST /connector-proxy/v1/do/http/GetRequest HTTP/1.0" 200 176 "-" "python-requests/2.31.0"
[2024-05-27 06:19:44 +0000] [143937] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:20:28 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143966)
[2024-05-27 06:20:28 +0000] [143966] [INFO] Worker exiting (pid: 143966)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:20:28 +0000] [161550] [INFO] Booting worker with pid: 161550
[2024-05-27 06:20:29 +0000] [143697] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:20:47 +0000] [161550] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:21:15 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143937)
[2024-05-27 06:21:15 +0000] [143937] [INFO] Worker exiting (pid: 143937)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:21:15 +0000] [161578] [INFO] Booting worker with pid: 161578
[2024-05-27 06:21:15 +0000] [161578] [DEBUG] Ignoring EPIPE
[2024-05-27 06:21:48 +0000] [161550] [DEBUG] Ignoring EPIPE
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
RESPONSE: 200
RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"database is locked"}}
[2024-05-27 06:22:00 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:143697)
[2024-05-27 06:22:00 +0000] [143697] [INFO] Worker exiting (pid: 143697)

We can see 3 new interesting error messages:

Most interestingly the task says it failed at 08:19:42 but we se no error at that time. Only later.

madhurrya commented 4 months ago

This is something I have noticed in the past and noticed today also. Normally I open the Status app only when I want to test the messages. That can be once in few days or once in a week sometimes. I have noticed that most of the time when I open the status app then the instances start to error for some time. Yesterday and today morning I created about 100+ instances and none of them errored. I didn't have the Status app opened at that time. But later today I opened the Status app and then created those instances (which I shared above) and they started getting the error. This is something I have observed in the past also. Just shared what I have observed. Not sure whether it's a valid point or not.

These I ran after starting the Status app and some of them errored. image

I ran 50 instances of this in the morning before opening the Status app without any error. And another 100+ instances yesterday without any error. ![image](https://github.com/status-im/status-go/assets/110527567/6a564c82-d757-404e-9496-8918758fde43)
jakubgs commented 4 months ago

The database locked and Ignoring EPIPE errors might be a red herring as I can find only 4 instances of it in connector logs:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'database is locked' /var/log/docker/spiff-workflow-connector/docker.log | wc -l
4

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'Ignoring EPIPE' /var/log/docker/spiff-workflow-connector/docker.log | wc -l    
4

But the WORKER TIMEOUT certainly seems more prevalent:

jakubgs@node-01.sm-eu-zur1.spiff.app-test:~ % zgrep 'WORKER TIMEOUT' /var/log/docker/spiff-workflow-connector/docker.log | wc -l
10
jakubgs commented 4 months ago

And indeed if I search for instances of WORKER TIMEOUT I get next to it an error like this:

POST http://wakunode:8545/
{'jsonrpc': '2.0',
 'method': 'wakuext_sendOneToOneMessage',
 'params': [{'id': 'zQ3shwDqFxGXHvbCeQxTURrtSTCH4UtPHaEDu5LgNMpsjtqdF',
   'message': 'There was an exception running process model manage-procurement/procurement/requisition-order-management/request-goods-services for instance 355.\nOriginal Error:\nWorkflowExecutionServiceError("Error executing Service Task. Received error code \'ReadTimeout\' from service \'waku/SendMessage\' ::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties. ::: The original message: HTTPSConnectionPool(host=\'test.app.spiff.status.im\', port=443): Read timed out. (read timeout=45)")'}],
 'id': 1}

Crucial part:

Error executing Service Task. Received error code \'ReadTimeout\' from service \'waku/SendMessage\'
::: 500 (Internal Server Error) - The service you called is experiencing technical difficulties.
::: The original message: HTTPSConnectionPool(host=\'test.app.spiff.status.im\', port=443): Read timed out. (read timeout=45)")'}],
jakubgs commented 4 months ago

Another instance at 06:24:30 failed at 06:25:15: https://test.app.spiff.status.im/i/642

[2024-05-27 06:24:28 +0000] [161614] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:24:30 +0000] [161706] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
[2024-05-27 06:25:12 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:161578)
[2024-05-27 06:25:12 +0000] [161578] [INFO] Worker exiting (pid: 161578)
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
[2024-05-27 06:25:13 +0000] [161742] [INFO] Booting worker with pid: 161742
[2024-05-27 06:25:13 +0000] [161742] [DEBUG] Ignoring EPIPE
POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
RESPONSE: 200
RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"result":{"chats":[{"id":" ... }]}}
[2024-05-27 06:25:58 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:161614)

We can see WORKER TIMEOUT and Ignoring EPIPE go hand in hand.

jakubgs commented 4 months ago

Interestingly neither Exception not HTTPError lines were reached even once, which means the send_message() method in connector never fails, and the failure is somewhere higher.

jakubgs commented 4 months ago

Here's an example of failure: https://test.app.spiff.status.im/process-instances/misc:qa:send-waku-message/803 And here are the logs:

2024-05-27T15:48:40.801101+00:00 [9] [DEBUG] POST /connector-proxy/v1/do/waku/SendMessage
2024-05-27T15:48:53.401283+00:00 SendMessage(message='=====Waku message test from test.app===', message_type='wakuext_sendOneToOneMessage', recipient=['zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ'])
2024-05-27T15:48:53.401293+00:00 execute START
2024-05-27T15:48:53.401300+00:00 send_message START
2024-05-27T15:48:53.401305+00:00 POST http://wakunode:8545/ {'jsonrpc': '2.0', 'method': 'wakuext_addContact', 'params': [{'id': 'zQ3shYEjHm6xQaQaUSoP2rjcuQKw5nXUQtUT9w893UiYPfZfJ', 'message': None}], 'id': 1}
2024-05-27T15:48:53.401309+00:00 RESPONSE: 200
2024-05-27T15:48:53.401313+00:00 RESPONSE TEXT: {"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"database is locked"}}
2024-05-27T15:48:53.401318+00:00 send_message END
2024-05-27T15:48:53.401322+00:00 execute END
2024-05-27T15:49:40.869893+00:00 [2024-05-27 15:49:40 +0000] [9] [DEBUG] Ignoring EPIPE

There's some other examples of database is locked errors as well. But the interesting part is it returns right after request.

There is no delay, so why would it cause a timeout.

jakubgs commented 4 months ago

Possibly relevant issue:

jakubgs commented 4 months ago

After migrating app-test environment to DigitalOcean in:

The problems have subsided, for now at least. So I'm closing this issue for now.