graphprotocol / graph-node

Graph Node indexes data from blockchains such as Ethereum and serves it over GraphQL
https://thegraph.com
Apache License 2.0
2.91k stars 971 forks source link

[Bug] Rewinding a subgraph causes a constraint violation in graph-node that in turn causes indexer-agent to crashloop #5316

Open cryptovestor21 opened 7 months ago

cryptovestor21 commented 7 months ago

Bug report

graph-node:v0.34.1 indexer-agent:v0.20.22

Activities that were undertaken before observing this bug:

  1. Cleared call_cache for Arbitrum as part of a complex subgraph sync performance troubleshooting exercise via psql
  2. Rewound a specific problematic subgraph, Silo Finance Arbitrum, QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW to block 1 using graphman
  3. Observed the above subgraph syncing to ~130m blocks, then stalled.
  4. Checked graph-node logs and found related error (see log output)
  5. Observed indexer-agent complaining about same issue and crash looping - cannot use the agent at all right now to manage subgraphs (see log output)

IMPACT: Production Indexer at risk; we cannot manage our online and offline allocations while we have this issue - ideally need a temp fix for the specific symptoms. Would graphman drop resolve the issue? Would the graph-node and indexer-agent be able to handle that and start syncing the sub again from scratch given this is a subgraph in flight with live allocations?

Relevant log output

----- GRAPH-NODE
Apr 04 13:24:34.037 ERRO Subgraph instance failed to run: internal constraint violated: Subgraph writer for QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW[sgd622] is not running, sgd: 622, subgraph_id: QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW, component: SubgraphInstanceManager
Apr 04 13:48:18.741 WARN Price provider Removed: 0x8dca64a43865454f41aa1a3cf0140eb89f2c08aa53871235ecbe46b6a309a1e3, data_source: PriceProvidersRepository, sgd: 622, subgraph_id: QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW, component: SubgraphInstanceManager > UserMapping
Apr 04 13:48:18.742 ERRO Oracle was not found when trying to remove it at txn: 0x8dca64a43865454f41aa1a3cf0140eb89f2c08aa53871235ecbe46b6a309a1e3, data_source: PriceProvidersRepository, sgd: 622, subgraph_id: QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW, component: SubgraphInstanceManager > UserMapping

----- INDEXER-AGENT
{"level":50,"time":1712241706593,"pid":1,"hostname":"268ad9e1400b","name":"IndexerAgent","component":"GraphNode","err":{"type":"IndexerError","message":"Failed to query indexing status API","stack":"IndexerError: Failed to query indexing status API\n    at indexerError (/opt/indexer/packages/indexer-common/dist/errors.js:173:12)\n    at GraphNode.<anonymous> (/opt/indexer/packages/indexer-common/dist/graph-node.js:146:55)\n    at Generator.next (<anonymous>)\n    at fulfilled (/opt/indexer/packages/indexer-common/dist/graph-node.js:5:58)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)","code":"IE018","explanation":"https://github.com/graphprotocol/indexer/blob/main/docs/errors.md#ie018","cause":{"type":"CombinedError","message":"[GraphQL] Store error: internal constraint violated: the entityCount for QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW is not representable as a u64","name":"CombinedError","graphQLErrors":[{"message":"Store error: internal constraint violated: the entityCount for QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW is not representable as a u64"}],"response":{"size":0,"timeout":0}}},"msg":"Failed to query indexing status API"}

IPFS hash

No response

Subgraph name or link to explorer

https://thegraph.com/explorer/subgraphs/2ufoztRpybsgogPVW6j9NTn1JmBWFYPKbP7pAabizADU?view=Overview&chain=arbitrum-one

Some information to help us out

OS information

Linux

leoyvens commented 7 months ago

the entityCount for QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW is not representable as a u64

Maybe the rewind somehow turned the entity count negative. Which is a bug of course.

trader-payne commented 7 months ago

@leoyvens I think the problem was coming from that rewind to block 1 when the startblock was actually 51880000 That means the graphnode doesn't handle that scenario, and it created all that chaos.

github-actions[bot] commented 1 month ago

Looks like this issue has been open for 6 months with no activity. Is it still relevant? If not, please remember to close it.