graphprotocol / graph-node

Graph Node indexes data from blockchains such as Ethereum and serves it over GraphQL
https://thegraph.com
Apache License 2.0
2.91k stars 978 forks source link

[Bug] Impossible combination of entity operations #5449

Open paymog opened 5 months ago

paymog commented 5 months ago

Bug report

A subgraph that was running without issues on v0.34.0 suddenly started failing in v0.35.0. This subgraph is deployed across many networks and they all started failing with this issue which suggests that this is a regression in v0.35.0.

Relevant log output

May 28 13:26:21.945 ERRO Subgraph failed with non-deterministic error: Failed to transact block operations: internal constraint violated: impossible combination of entity operations: Remove { key: EntityKey(SplitRecipient[0x7c29ca34b44d388ab031ecce7781f2420e1e5c99-0xfa9aad02ffede509520e27ef329ee28871a76828-5], cr=0), block: 15264023 } and then Remove { key: EntityKey(SplitRecipient[0x7c29ca34b44d388ab031ecce7781f2420e1e5c99-0xfa9aad02ffede509520e27ef329ee28871a76828-5], cr=0), block: 15267303 }, retry_delay_s: 108, attempt: 0, sgd: 1, subgraph_id: QmcpChELh7eJShPHvG5zLBUYBsBQby9KZ8roh7BrT2Yp5B, component: SubgraphInstanceManager

IPFS hash

QmcpChELh7eJShPHvG5zLBUYBsBQby9KZ8roh7BrT2Yp5B

Subgraph name or link to explorer

No response

Some information to help us out

OS information

Linux

paymog commented 5 months ago

we also see Failed to transact block operations: internal constraint violated: Batches must go forward. Can't append a batch with block pointer #114200817 as another issue happening on these subgraphs but this one happens less reliably.

paymog commented 5 months ago

Seems like this issue might be related to batching. Trying to bisect and the issue doesn't seem to happen reliably on any commits. Thought I bisected down to 31943fc706c84e8afe4a3677b7cf172339d72461 but then I went to previous commit to test (and didn't find issues). Changed back to 31943fc706c84e8afe4a3677b7cf172339d72461 and now the issue isn't happening. Very unusual. It also doesn't make sense that this would be the offending commit.

paymog commented 5 months ago

Now I'm thinking this might be a subgraph bug that wasn't revealed until we upgraded to v0.35.0

paymog commented 5 months ago

Setting GRAPH_STORE_WRITE_BATCH_SIZE=0 seems to resolve the issue

paymog commented 5 months ago

The only commits I see between 0.34.0 and 0.35.0 related to batching are for enabling/disabling batching based on whether the subgraph is caught up and in my local testing the subgraph is in the process of catching up so batching is definitely enabled. Did any other batching changes happen between these two releases? cc @leoyvens @lutter

Alternatively, could there be some changes to the logic that affect loading entities? The subgraph in question has a flow like:

  1. parse list of addresses in event
  2. load the relevant entity
  3. for any addresses that existed in the entity before but do not exist in the current event, use store.remove to remove them
  4. save the entity with the latest list of addresses

Since we see two remove modifications here, could it be that something is going on with the step 4 (not properly saving before committing) or step 2 (not properly loading during a batch)?