Closed matthewdarwin closed 5 months ago
tier1 config:
start:
args:
- substreams-tier1
flags:
log-verbosity: 2
common-live-blocks-addr: dns:///eth-sf-relayer:9000
common-merged-blocks-store-url: s3://...
common-one-block-store-url: s3://...
common-forked-blocks-store-url: s3://...
substreams-rpc-endpoints: http://eth-erigon:8545
substreams-state-store-url: s3://....
substreams-tier1-grpc-listen-addr: :9000
substreams-tier1-max-subrequests: 400
substreams-tier1-subrequests-endpoint: localhost:5000 # this is envoy proxy to tier2
Hi @matthewdarwin I cannot reproduce here with v2.3.5 or with the latest (develop).
I ran this command:substreams -e localhost:9000 run --insecure ipfs://QmRwHWApq6SnvEzy3RUBd5j9WmbRVgFQGTsXwmfhh79uj5 graph_out -s 19359900 -t 19360000 --production-mode
Could you try maybe running both nodes on the same machine with a local state store URL to see if your S3 server maybe getting stuck (preventing the tier1 from reading the file written from the other node) ?
That's one of the very few possible things that I see from your described behavior...
I did some more debugging just now on latest antelope-firhose (will all the latest changes from today). I put tier1 and tier2 on the same node, and tested the difference between local storage and s3. The problem happens with s3. I put debug on level 4, but there is still no info to debug it
We need some tracing on all the s3 HTTP requests?
4cc7f8ed86fc278557ac9efbee5db0ae.log
config:
start:
args:
- substreams-tier1
- substreams-tier2
flags:
log-verbosity: 4
log-to-file: false
common-auth-plugin: ....
common-metering-plugin: ...
common-live-blocks-addr: ...
common-merged-blocks-store-url: s3://...
common-one-block-store-url: s3://...
common-forked-blocks-store-url: s3://...
common-system-shutdown-signal-delay: 10s
common-auto-mem-limit-percent: 90
substreams-state-store-url: s3://...
substreams-tier1-grpc-listen-addr: :9000
Test substream:
substreams run -e wax.substreams.pinax.network:443 notify-actions-v0.2.0.spkg jsonl_out -s -1000 --params=map_logactions="aw_land_id=1099512960477,1099512961456,1099512959292,1099512960182,1099512958801,1099512961178" --production-mode
(see eosnation dfuse telegram for discussion with the impacted user)
Fixed here https://github.com/streamingfast/substreams/commit/fa91cf14ac98372f64161bef68829ade2ccaeba9 confirmed with @matthewdarwin, thanks for your help debugging this!
Substreams are still getting stuck. This is reproducible on stubstreams with no "store" modules.
Seems like the handoff between doing jobs in tier2 and having tier1 handle it has some issues. Happens all the time, using many different subgraphs that only have "map" modules.
Test is like
It spins up a job to catch up, then when that is done, it gets stuck with "0 jobs" and never continues or exits. After next restart, then it is fine.
Note
eth-sfst79.mar.eosn.io
is an internal tier1 node, so we bypass the load balancer to remove any related issues.Turning on debug logging doesn't give any more info.
Using latest firehose-ethereum code here (v2.3.5) with RPC cache turned off.