Failed reflecting node_contract updates could not serialize access

sameh-farouk commented 3 months ago

What happened?

Processor instances across all networks keep restarting with the error: ʼFailed reflecting node_contract updates could not serialize access due to concurrent updateʼ

Original issue: https://github.com/threefoldtech/tfchain_graphql/issues/191

which network/s did you face the problem on?

Dev, QA, Test, Main

Twin ID/s

No response

Version

No response

Node ID/s

No response

Farm ID/s

No response

Contract ID/s

No response

Relevant log output

{"level":2,"time":1723610115263,"ns":"sqd:processor","msg":"13789842 / 13789842, rate: 0 blocks/sec, mapping: 8 blocks/sec, 83 items/sec, ingest: 47 blocks/sec, eta: 0s"}
{"level":2,"time":1723610125306,"ns":"sqd:processor","msg":"13789843 / 13789843, rate: 0 blocks/sec, mapping: 8 blocks/sec, 91 items/sec, ingest: 46 blocks/sec, eta: 0s"}
{"level":2,"time":1723610130999,"ns":"sqd:processor","msg":"13789844 / 13789844, rate: 0 blocks/sec, mapping: 4 blocks/sec, 55 items/sec, ingest: 45 blocks/sec, eta: 0s"}
{"level":2,"time":1723610139527,"ns":"sqd:processor","msg":"13789845 / 13789845, rate: 0 blocks/sec, mapping: 1 blocks/sec, 25 items/sec, ingest: 46 blocks/sec, eta: 0s"}
{"level":5,"time":1723610164513,"ns":"sqd:processor","err":{"query":"INSERT INTO \"node_contract\"(\"id\", \"grid_version\", \"contract_id\", \"twin_id\", \"node_id\", \"deployment_data\", \"deployment_hash\", \"number_of_public_i_ps\", \"state\", \"created_at\", \"solution_provider_id\", \"resources_used_id\") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, DEFAULT) ON CONFLICT ( \"id\" ) DO UPDATE SET \"id\" = EXCLUDED.\"id\", \"grid_version\" = EXCLUDED.\"grid_version\", \"contract_id\" = EXCLUDED.\"contract_id\", \"twin_id\" = EXCLUDED.\"twin_id\", \"node_id\" = EXCLUDED.\"node_id\", \"deployment_data\" = EXCLUDED.\"deployment_data\", \"deployment_hash\" = EXCLUDED.\"deployment_hash\", \"number_of_public_i_ps\" = EXCLUDED.\"number_of_public_i_ps\", \"state\" = EXCLUDED.\"state\", \"created_at\" = EXCLUDED.\"created_at\", \"solution_provider_id\" = EXCLUDED.\"solution_provider_id\"","parameters":["0013789691-000713-fab74",4,"611406",10972,888,"{\"version\":3,\"type\":\"network\",\"name\":\"example_c124_network\",\"projectName\":\"vm/group_c\"}","0e549bca93f361471fcd7a22595fc3a2",0,"Deleted","1723609194",0],"driverError":{"length":221,"name":"error","severity":"ERROR","code":"P0001","where":"PL/pgSQL function reflect_node_contract_changes() line 27 at RAISE","file":"pl_exec.c","line":"3909","routine":"exec_stmt_raise","stack":"error: failed reflecting node_contract updates could not serialize access due to concurrent update\n    at /squid/node_modules/pg/lib/client.js:526:17\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async PostgresQueryRunner.query (/squid/node_modules/typeorm/driver/postgres/PostgresQueryRunner.js:184:25)\n    at async InsertQueryBuilder.execute (/squid/node_modules/typeorm/query-builder/InsertQueryBuilder.js:106:33)\n    at async Store.upsert (/squid/node_modules/@subsquid/typeorm-store/lib/store.js:34:13)\n    at async nodeContractCanceled (/squid/lib/mappings/contracts.js:239:5)\n    at async /squid/lib/processor.js:138:13\n    at async TypeormDatabase.runTransaction (/squid/node_modules/@subsquid/typeorm-store/lib/database.js:110:13)\n    at async TypeormDatabase.transact (/squid/node_modules/@subsquid/typeorm-store/lib/database.js:64:24)\n    at async Runner.process (/squid/node_modules/@subsquid/substrate-processor/lib/processor/runner.js:117:17)"},"length":221,"severity":"ERROR","code":"P0001","where":"PL/pgSQL function reflect_node_contract_changes() line 27 at RAISE","file":"pl_exec.c","line":"3909","routine":"exec_stmt_raise","stack":"QueryFailedError: failed reflecting node_contract updates could not serialize access due to concurrent update\n    at PostgresQueryRunner.query (/squid/node_modules/typeorm/driver/postgres/PostgresQueryRunner.js:219:19)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async InsertQueryBuilder.execute (/squid/node_modules/typeorm/query-builder/InsertQueryBuilder.js:106:33)\n    at async Store.upsert (/squid/node_modules/@subsquid/typeorm-store/lib/store.js:34:13)\n    at async nodeContractCanceled (/squid/lib/mappings/contracts.js:239:5)\n    at async /squid/lib/processor.js:138:13\n    at async TypeormDatabase.runTransaction (/squid/node_modules/@subsquid/typeorm-store/lib/database.js:110:13)\n    at async TypeormDatabase.transact (/squid/node_modules/@subsquid/typeorm-store/lib/database.js:64:24)\n    at async Runner.process (/squid/node_modules/@subsquid/substrate-processor/lib/processor/runner.js:117:17)"}}
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Omarabdul3ziz commented 2 months ago

update:

we checked the processor status on dev/main instances and it looks like it all works fine now, it crashed, restarted, and worked fine from then.

it looks like the issue occurred due to multiple triggered functions trying to update the same row on the resources_cache table, i applied a lock on each row for update which will block other transactions until the lock is released. also instead of throwing the exception, i log it instead so if it fails after that it will be safe from the processor side.

this fix in version v0.15.14 which is deployed now on devnet. i will keep an eye with ops on the logs to validate the fix and proceed to other networks if all works well.

rawdaGastan commented 1 month ago

Verifications

I think the problem is solved now, We are on v0.15.18

threefoldtech / tfgrid-sdk-go