Open ross-spencer opened 11 months ago
Some relevant context https://github.com/yesodweb/wai/issues/825
Asked to recreate this I simply setup a bash script to submit transactions and keep track of file handles:
#! /usr/bin/bash
count=1000
for i in $(seq $count); do
echo "============"
echo "iteration $i"
echo "============"
ID=$(uuid -v4)
echo $ID
echo $(json-fs-store-cli insert-fact-statement --db $JS_STORE_DIR/json-store.db --fact_statement_id "$ID" --json '{"@context": "https://schema.org", "type": "PropertyValue", "name": "TEST1/TEST2", "value": 0.3142, "valueReference": [{"@type": "PropertyValue", "name": "validFrom", "value": 1705404151}, {"@type": "PropertyValue", "name": "validThrough", "value": 1705404451}], "ArklyID": "4343235324235235432453FAC710", "SecurityToken": "32R43243223432432"} ')
REQ=$(cat <<EOF
{
"fsInfos": [
{
"fsId": "$(echo -ne $ID | base64)",
"gcAfter": {
"extended": "NEG_INF"
}
}
],
"submitter": {
"base16": "$SUBMITTER_PKH"
}
}
EOF
)
COOP_PROTO=coop-proto
RESP=$(echo $REQ | grpcurl -insecure -import-path $COOP_PROTO \
-proto $COOP_PROTO/publisher-service.proto -d @ \
localhost:5080 coop.publisher.Publisher/createMintFsTx
)
echo "fs handles ($i): " $(lsof | grep "json-fs" | wc -l && date)
echo "publisher handles ($i)" $(lsof | grep "coop-publisher" | wc -l && date)
echo "coop-pab handles ($i)" $(lsof | grep "coop-pab" | wc -l && date)
echo $(sleep 120)
echo "done"
done
We basically see the publisher grpc or the local fact store fail in different ways:
============
iteration 511
============
3a69e849-4d49-4072-ae96-4610c046ac31
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "3a69e849-4d49-4072-ae96-4610c046ac31",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
Failed to dial target host "localhost:5080": read tcp [::1]:33474->[::1]:5080: read: connection reset by peer
fs handles (iter. 511): 543 Thu Jan 18 05:31:14 UTC 2024
publisher handles (iter. 511) 0 Thu Jan 18 05:31:14 UTC 2024
coop-pab handles (iter. 511) 34 Thu Jan 18 05:31:14 UTC 2024
^^^
Network.Socket.accept: resource exhausted (Too many open files)
[coop-env ~ coop.0] $ coop-publisher-cli publisher-grpc
Second failure:
============
iteration 136
============
d85ec7ff-a3eb-4d96-bbc3-d8b1013fb9c9
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "d85ec7ff-a3eb-4d96-bbc3-d8b1013fb9c9",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
ERROR:
Code: Internal
Message: Network.Socket.connect: <socket: 1023>: does not exist (Connection refused)
fs handles (iter. 136): 0 Fri Jan 19 14:02:33 UTC 2024
publisher handles (iter. 136) 1 Fri Jan 19 14:02:33 UTC 2024
coop-pab handles (iter. 136) 34 Fri Jan 19 14:02:33 UTC 2024
done
^^^
SELECT "t0"."fact_statement_id" AS "res0", "t0"."json" AS "res1" FROM "fact_statements" AS "t0" WHERE ("t0"."fact_statement_id")=(?);
-- With values: [SQLBlob "3a4c52d6-aed1-4f17-82c9-4e20238fd4cc"]
Network.Socket.accept: resource exhausted (Too many open files)
After setting ulimit -n 65535
we see the failures again but slightly different messages:
============
iteration 508
============
ccfdc2f7-6e5b-433a-a9df-53d1733a2298
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "ccfdc2f7-6e5b-433a-a9df-53d1733a2298",SQLBlob "{\"@context\": \"htt
ps://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom
\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"Securi
tyToken\": \"32R43243223432432\"} "]
fs handles (iter. 508): 1052 Tue Jan 23 10:09:10 UTC 2024
publisher handles (iter. 508) 1 Tue Jan 23 10:09:11 UTC 2024
coop-pab handles (iter. 508) 35 Tue Jan 23 10:09:11 UTC 2024
done
============
iteration 509
============
53543f1b-4b91-46cf-bbb5-8db68fe954fc
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "53543f1b-4b91-46cf-bbb5-8db68fe954fc",SQLBlob "{\"@context\": \"htt
ps://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom
\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"Securi
tyToken\": \"32R43243223432432\"} "]
fs handles (iter. 509): 1053 Tue Jan 23 10:11:21 UTC 2024
publisher handles (iter. 509) 1 Tue Jan 23 10:11:22 UTC 2024
coop-pab handles (iter. 509) 35 Tue Jan 23 10:11:22 UTC 2024
============
iteration 510
============
41c14ca9-39ae-4c54-8708-bb8ef0aebce4
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "41c14ca9-39ae-4c54-8708-bb8ef0aebce4",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
ERROR:
Code: Unavailable
Message: error reading from server: EOF
^^^^
"Sending CreateMintFsTxReq to TxBuilder: {fact_statements { fs_id: \"41c14ca9-39ae-4c54-8708-bb8ef0aebce4\" fs { pdmap { kvs { key { pdbytes: \"@context\" } value { pdbytes: \"https://schema.org\" } } kvs { key { pdbytes: \"ArklyID\" } value { pdbytes: \"4343235324235235432453FAC710\" } } kvs { key { pdbytes: \"SecurityToken\" } value { pdbytes: \"32R43243223432432\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"TEST1/TEST2\" } } kvs { key { pdbytes: \"type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"value\" } value { pdconstr { index: 3 fields { pdint: 3142 } fields { pdint: 18446744073709551612 } } } } kvs { key { pdbytes: \"valueReference\" } value { pdlist { elements { pdmap { kvs { key { pdbytes: \"@type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"validFrom\" } } kvs { key { pdbytes: \"value\" } value { pdint: 1705404151 } } } } elements { pdmap { kvs { key { pdbytes: \"@type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"validThrough\" } } kvs { key { pdbytes: \"value\" } value { pdint: 1705404451 } } } } } } } } } gc_after { } } submitter { base16: \"bedfff1b6762f7a95437ab8ca30b3244dfe44029125d23656062a8b6\" }}"
coop-publisher-cli: file descriptor 1024 out of range for select (0--1024).
Recompile with -threaded to work around this.
============
iteration 1
============
c9bacac0-80a6-4422-a48c-7d5e786f95ce
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "c9bacac0-80a6-4422-a48c-7d5e786f95ce",SQLBlob "$TEST_STMT"]
fs handles (iter. 1): 1058 Wed Jan 24 09:24:24 UTC 2024
publisher handles (iter. 1) 1 Wed Jan 24 09:24:24 UTC 2024
coop-pab handles (iter. 1) 34 Wed Jan 24 09:24:24 UTC 2024
^C
(venv) [coop-env ~ coop.0] $ nano XTEST.sh
(venv) [coop-env ~ coop.0] $ ./XTEST.sh
============
iteration 1
============
467c3877-eac8-4edd-af6f-5be40211fee7
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "467c3877-eac8-4edd-af6f-5be40211fee7",SQLBlob "$TEST_STMT"]
ERROR:
Code: Internal
Message: HandshakeFailed (Error_Misc "Network.Socket.recvBuf: resource vanished (Connection reset by peer)")
^^^
SELECT "t0"."fact_statement_id" AS "res0", "t0"."json" AS "res1" FROM "fact_statements" AS "t0" WHERE ("t0"."fact_statement_id")=(?);
-- With values: [SQLBlob "c9bacac0-80a6-4422-a48c-7d5e786f95ce"]
json-fs-store-cli: file descriptor 1029 out of range for select (0--1024).
Recompile with -threaded to work around this.
We're seeing the json-fs-store-cli
handles going up consistently (perhaps there's something being missed when closing the sqlite connection?) -- as for the other file handles, there doesn't seem to be a correlation in what I can gather with these techniques; other than it doesn't take much more than 1000 iterations to see the two different failures.
Let me know if there's anything else I can test/look for.
We've been running COOP as a long-running service for a decent while now. Every so often, we're seeing services stop with too many open file handles:
Network.Socket.accept: resource exhausted (Too many open files)
:Are there any known improvements in the underlying GRPC libraries that we can leverage here to prevent this? Or is there something else we can see in the code that might have been missed preventing file handles from being closed?