graphprotocol / graph-node

Graph Node indexes data from blockchains such as Ethereum and serves it over GraphQL
https://thegraph.com
Apache License 2.0
2.92k stars 978 forks source link

[Bug] graph-node temporary outage with `store error: NotFound` #4739

Open SozinM opened 1 year ago

SozinM commented 1 year ago

Bug report

We experience an error (logs attached) while running the subgraph causing a temporary sync stop: image

Seems like the error is linked to some store problem but we can't find any problems. All metrics are good for both PostgreSQL and graph-node.

99% of the time node is working absolutely fine. Also, this problem occurs on all subgraphs at the same time.

Graph version is v0.31.0 Docker image is graphprotocol/graph-node:v0.31.0

Relevant log output

Jul 04 08:23:51.258 DEBG Block stream produced a non-fatal error, error: store error: NotFound, sgd: 319, subgraph_id: QmQy5znQNDrZiVq8L7o3zFaTadiQz3tzgfXASBLz2WFgXL, component: SubgraphInstanceManager

IPFS hash

No response

Subgraph name or link to explorer

No response

Some information to help us out

OS information

None

SozinM commented 1 year ago

Also so far I found that cache hist misses correlate with the moment of temporary outage: image image

SozinM commented 1 year ago

It seems like the reason for this behavior is that sometimes Graph starts to write a LOT of transactions to PostgreSQL image NOTE that this picture does not correlate with the picture above. It's different days.

SozinM commented 1 year ago

Also from the source code it seems that error captured is: StoreError::Unknown(DieselError::NotFound)

azf20 commented 1 year ago

hey @SozinM what version of Graph Node are you running?

SozinM commented 1 year ago

Hi @azf20 ! Sorry for missing that. We are using docker image graphprotocol/graph-node:v0.31.0 Which corresponds to the latest released version.

azf20 commented 1 year ago

Thanks - how many subgraphs are you running on your Graph Node?

SozinM commented 1 year ago

I think it's about 50 on this one

SozinM commented 1 year ago

@azf20 any ideas?

azf20 commented 1 year ago

cc @lutter the intermittent store issue, in case this helps

PekopT commented 1 year ago

@lutter @azf20 bump please, we're struggling with reliability

azf20 commented 1 year ago

I wonder if there is something specific with the subgraphs or networks you are running which is causing this (particularly given the spikes in transactions)? Do you see this on all instances or just in one case?

SozinM commented 1 year ago

We see it on multiple indexers and I assume it is directly linked to the load on Postgres. We fist saw this behavior on loaded indexers. Also, we store block cache with subgraph on the single Postgres indexer.

lutter commented 1 year ago

NotFound indicates that graph-node tried to look up a row that does not exist; it shouldn't be dependent on load. Can you check your postgres logs for any errors?

SozinM commented 11 months ago

I was monitoring Postgresql and did not see any anomalies. The only errors I saw - were about connection dropping because the consumer does not respond (or something along these lines). @PekopT @balakhonoff check database again please