streamingfast / substreams

Powerful Blockchain streaming data engine, based on StreamingFast Firehose technology.
Apache License 2.0
159 stars 45 forks source link

Substreams fail-non deterministically on Substreams Uniswap V3 #324

Closed maoueh closed 11 months ago

maoueh commented 11 months ago

Here the QM hash: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag.

Payne:

Hey dude, I ran into a very weird issue with your UniV3 Substream - it failed non-deterministically and I can't get it back up. Here's the error after rewinding -

Oct 05 20:09:28.733 INFO Subgraph error is still ahead of deployment head, nothing to unfail, error_block_hash: 0xcba8ea91284d16df41593536ff2700188d2db15174ce6acbfb6f2085fbf56f76, error_block_range: (Included(18280709), Unbounded), block_hash: 0xa7b944ff2712585509aa993838575962d90ca107f28c61c6847dd078c355e850, block_number: 18280708, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, shard: primary, component: Store

Then it fails with

Oct 05 20:09:28.854 ERRO Subgraph failed with non-deterministic error: failed to process trigger: Failed to process substreams block: Entity TokenHourData[0x488db574c77dd27a07f9c97bac673bc8e9fc6bf3-471240]: missing value for non-nullable field `periodStartUnix`, retry_delay_s: 140, attempt: 0, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager

Wiping out the block cache on my side doesn't do anything

maoueh commented 11 months ago
SessionInit { trace_id: "206f069bff5bbb58cfbee72f1c2e7a1c", resolved_start_block: 18280701, linear_handoff_block: 18286400, max_parallel_workers: 10 }
Eduard-Voiculescu commented 11 months ago

First blocks where we get data are: 18280694 18280695 18280699 18280709 He gets the issue at block 18280708 so assuming that his substreams doesn't see the data before.

Ran on pinax endpoint with: substreams gui https://github.com/streamingfast/substreams-uniswap-v3/releases/download/v0.2.7/substreams.spkg graph_out -e eth.substreams.pinax.network:9000 -s 18280000 -t +1000 and the output is the same as what we see on our endpoint

maoueh commented 11 months ago

There was no cursor when starting the request, blocks received:

sduchesneau commented 11 months ago

What if Payne rewinds up to block 18280690 ?

maoueh commented 11 months ago

He rewound to 18280701, I see that from the logs. Forgot to attach them: 2iSqZNcp.log

maoueh commented 11 months ago

After further investigation, it appears the error is probably because the graph-node db is "broken" in some ways. Indeed, this error happens because that most probably, the CREATE entity change was not written correctly to the DB (corruption, WAL problem, etc.). When graph-node arrives at block 18280709, it tries to merge the entity UPDATE is receives with the current entity. The Update did not contain the periodStartUnix because this field is present only on the CREATE one.

The merged entity is then validated at validate which leads to an error because it sees that periodStartUnix is not set properly.

sduchesneau commented 11 months ago

note: Since there were other updates of that entity between its creation at block 18280694and the error at block 18280709 that did not generate an issue, we can determine that substreams had correctly sent the CREATE to the graph-node, before the corruption event happened, and that his event deleted some data that predates the rewind target.

trader-payne commented 11 months ago

Rewinded to block 18270600 and it got me past the error, I'm up to chainhead now. Curious why the error triggered in the first place though. Since it's non-deterministic too.

trader-payne commented 11 months ago

From David Lutter:

Hmm. Since TokenHourData gets updated throughout each hour, I wonder if it worked because you had to rewind past the hour where the problem happened. IIRC, substreams only sends the changes and since the timestamp doesn't change, it will only be sent when the hour starts. Hadn't thought of that before.

maoueh commented 11 months ago

Hmm. Since TokenHourData gets updated throughout each hour, I wonder if it worked because you had to rewind past the hour where the problem happened. IIRC, substreams only sends the changes and since the timestamp doesn't change, it will only be sent when the hour starts. Hadn't thought of that before.

It is true that Substreams, when you reconnect at 18280701 would stream only the UPDATE portion. But if the CREATE was part of the database, no problem would have happen. For this error to show up, it means that the CREATE was missed completely.

When you rewind let's say at 18280700, it should rewind only up to this point, not before and theCREATE from the Substreams should still be in the database.

trader-payne commented 11 months ago

Interesting. So if CREATE was missed completely, would that be an issue on the graph-node side? Or the substreams side

maoueh commented 11 months ago

Not clear what happen before the rewind, could be either one. Streaming Pinax endpoint seems to emit the correct flow of events from what we see, so seems unlikely that is was that, plus the fact that you needed to rewind.