mlabs-haskell / cardano-open-oracle-protocol

COOP - Cardano open oracle protocol
Apache License 2.0
21 stars 3 forks source link

GRPC file handles not closing #86

Open ross-spencer opened 8 months ago

ross-spencer commented 8 months ago

We've been running COOP as a long-running service for a decent while now. Every so often, we're seeing services stop with too many open file handles: Network.Socket.accept: resource exhausted (Too many open files):

Are there any known improvements in the underlying GRPC libraries that we can leverage here to prevent this? Or is there something else we can see in the code that might have been missed preventing file handles from being closed?

bladyjoker commented 8 months ago

Some relevant context https://github.com/yesodweb/wai/issues/825

ross-spencer commented 7 months ago

Asked to recreate this I simply setup a bash script to submit transactions and keep track of file handles:

#! /usr/bin/bash

count=1000
for i in $(seq $count); do

echo "============"
echo "iteration $i"
echo "============"

ID=$(uuid -v4)

echo $ID

echo $(json-fs-store-cli insert-fact-statement --db $JS_STORE_DIR/json-store.db --fact_statement_id "$ID" --json '{"@context": "https://schema.org", "type": "PropertyValue", "name": "TEST1/TEST2", "value": 0.3142, "valueReference": [{"@type": "PropertyValue", "name": "validFrom", "value": 1705404151}, {"@type": "PropertyValue", "name": "validThrough", "value": 1705404451}], "ArklyID": "4343235324235235432453FAC710", "SecurityToken": "32R43243223432432"} ')

REQ=$(cat <<EOF
    {
        "fsInfos": [
            {
                "fsId": "$(echo -ne $ID | base64)",
                "gcAfter": {
                    "extended": "NEG_INF"
                }
            }
        ],
        "submitter": {
            "base16": "$SUBMITTER_PKH"
        }
    }
EOF
)

COOP_PROTO=coop-proto

RESP=$(echo $REQ | grpcurl -insecure -import-path $COOP_PROTO \
 -proto $COOP_PROTO/publisher-service.proto -d @ \
 localhost:5080 coop.publisher.Publisher/createMintFsTx
)

echo "fs handles ($i): " $(lsof | grep "json-fs" | wc -l && date)
echo "publisher handles ($i)" $(lsof | grep "coop-publisher" | wc -l && date)
echo "coop-pab handles ($i)" $(lsof | grep "coop-pab" | wc -l && date)

echo $(sleep 120)

echo "done"

done

We basically see the publisher grpc or the local fact store fail in different ways:

============
iteration 511
============
3a69e849-4d49-4072-ae96-4610c046ac31
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "3a69e849-4d49-4072-ae96-4610c046ac31",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
Failed to dial target host "localhost:5080": read tcp [::1]:33474->[::1]:5080: read: connection reset by peer
fs handles (iter. 511):  543 Thu Jan 18 05:31:14 UTC 2024
publisher handles (iter. 511) 0 Thu Jan 18 05:31:14 UTC 2024
coop-pab handles (iter. 511) 34 Thu Jan 18 05:31:14 UTC 2024

^^^
Network.Socket.accept: resource exhausted (Too many open files)                                                                                                        
[coop-env ~ coop.0] $ coop-publisher-cli publisher-grpc      

Second failure:

============
iteration 136
============
d85ec7ff-a3eb-4d96-bbc3-d8b1013fb9c9
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "d85ec7ff-a3eb-4d96-bbc3-d8b1013fb9c9",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
ERROR:
  Code: Internal
  Message: Network.Socket.connect: <socket: 1023>: does not exist (Connection refused)
fs handles (iter. 136):  0 Fri Jan 19 14:02:33 UTC 2024
publisher handles (iter. 136) 1 Fri Jan 19 14:02:33 UTC 2024
coop-pab handles (iter. 136) 34 Fri Jan 19 14:02:33 UTC 2024

done

^^^
SELECT "t0"."fact_statement_id" AS "res0", "t0"."json" AS "res1" FROM "fact_statements" AS "t0" WHERE ("t0"."fact_statement_id")=(?);
-- With values: [SQLBlob "3a4c52d6-aed1-4f17-82c9-4e20238fd4cc"]
Network.Socket.accept: resource exhausted (Too many open files)

After setting ulimit -n 65535 we see the failures again but slightly different messages:

============                                                                                                                                                           
iteration 508                                                                                                                                                          
============                                                                                                                                                           
ccfdc2f7-6e5b-433a-a9df-53d1733a2298                                                                                                                                   
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "ccfdc2f7-6e5b-433a-a9df-53d1733a2298",SQLBlob "{\"@context\": \"htt
ps://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom
\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"Securi
tyToken\": \"32R43243223432432\"} "]                                                                                                                                   
fs handles (iter. 508):  1052 Tue Jan 23 10:09:10 UTC 2024                                                                                                                   
publisher handles (iter. 508) 1 Tue Jan 23 10:09:11 UTC 2024                                                                                                                 
coop-pab handles (iter. 508) 35 Tue Jan 23 10:09:11 UTC 2024                                                                                                                 

done                                                                                                                                                                   
============                                                                                                                                                           
iteration 509                                                                                                                                                          
============                                                                                                                                                           
53543f1b-4b91-46cf-bbb5-8db68fe954fc                                                                                                                                   
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "53543f1b-4b91-46cf-bbb5-8db68fe954fc",SQLBlob "{\"@context\": \"htt
ps://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom
\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"Securi
tyToken\": \"32R43243223432432\"} "]                                                                                                                                   
fs handles (iter. 509):  1053 Tue Jan 23 10:11:21 UTC 2024                                                                                                                   
publisher handles (iter. 509) 1 Tue Jan 23 10:11:22 UTC 2024                                                                                                                 
coop-pab handles (iter. 509) 35 Tue Jan 23 10:11:22 UTC 2024      

============
iteration 510
============
41c14ca9-39ae-4c54-8708-bb8ef0aebce4
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "41c14ca9-39ae-4c54-8708-bb8ef0aebce4",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
ERROR:
  Code: Unavailable
  Message: error reading from server: EOF

^^^^
"Sending CreateMintFsTxReq to TxBuilder: {fact_statements { fs_id: \"41c14ca9-39ae-4c54-8708-bb8ef0aebce4\" fs { pdmap { kvs { key { pdbytes: \"@context\" } value { pdbytes: \"https://schema.org\" } } kvs { key { pdbytes: \"ArklyID\" } value { pdbytes: \"4343235324235235432453FAC710\" } } kvs { key { pdbytes: \"SecurityToken\" } value { pdbytes: \"32R43243223432432\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"TEST1/TEST2\" } } kvs { key { pdbytes: \"type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"value\" } value { pdconstr { index: 3 fields { pdint: 3142 } fields { pdint: 18446744073709551612 } } } } kvs { key { pdbytes: \"valueReference\" } value { pdlist { elements { pdmap { kvs { key { pdbytes: \"@type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"validFrom\" } } kvs { key { pdbytes: \"value\" } value { pdint: 1705404151 } } } } elements { pdmap { kvs { key { pdbytes: \"@type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"validThrough\" } } kvs { key { pdbytes: \"value\" } value { pdint: 1705404451 } } } } } } } } } gc_after { } } submitter { base16: \"bedfff1b6762f7a95437ab8ca30b3244dfe44029125d23656062a8b6\" }}"
coop-publisher-cli: file descriptor 1024 out of range for select (0--1024).
Recompile with -threaded to work around this.
============
iteration 1
============
c9bacac0-80a6-4422-a48c-7d5e786f95ce
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "c9bacac0-80a6-4422-a48c-7d5e786f95ce",SQLBlob "$TEST_STMT"]
fs handles (iter. 1):  1058 Wed Jan 24 09:24:24 UTC 2024
publisher handles (iter. 1) 1 Wed Jan 24 09:24:24 UTC 2024
coop-pab handles (iter. 1) 34 Wed Jan 24 09:24:24 UTC 2024
^C
(venv) [coop-env ~ coop.0] $ nano XTEST.sh 
(venv) [coop-env ~ coop.0] $ ./XTEST.sh 
============
iteration 1
============
467c3877-eac8-4edd-af6f-5be40211fee7
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "467c3877-eac8-4edd-af6f-5be40211fee7",SQLBlob "$TEST_STMT"]
ERROR:
  Code: Internal
  Message: HandshakeFailed (Error_Misc "Network.Socket.recvBuf: resource vanished (Connection reset by peer)")

^^^
SELECT "t0"."fact_statement_id" AS "res0", "t0"."json" AS "res1" FROM "fact_statements" AS "t0" WHERE ("t0"."fact_statement_id")=(?);
-- With values: [SQLBlob "c9bacac0-80a6-4422-a48c-7d5e786f95ce"]
json-fs-store-cli: file descriptor 1029 out of range for select (0--1024).
Recompile with -threaded to work around this.

We're seeing the json-fs-store-cli handles going up consistently (perhaps there's something being missed when closing the sqlite connection?) -- as for the other file handles, there doesn't seem to be a correlation in what I can gather with these techniques; other than it doesn't take much more than 1000 iterations to see the two different failures.

Let me know if there's anything else I can test/look for.