substreams stuck switching from backprocessing to live

matthewdarwin commented 6 months ago

Substreams are still getting stuck. This is reproducible on stubstreams with no "store" modules.

Seems like the handoff between doing jobs in tier2 and having tier1 handle it has some issues. Happens all the time, using many different subgraphs that only have "map" modules.

Test is like

substreams run -e eth-sfst79.mar.eosn.io:9000 ipfs://QmRwHWApq6SnvEzy3RUBd5j9WmbRVgFQGTsXwmfhh79uj5 graph_out -s 19359900 -t 19360000 --production-mode --plaintext

It spins up a job to catch up, then when that is done, it gets stuck with "0 jobs" and never continues or exits. After next restart, then it is fine.

Note eth-sfst79.mar.eosn.io is an internal tier1 node, so we bypass the load balancer to remove any related issues.

Connected (trace ID 7e01eecadd1fd8f3e3127713c1a7cc06)
Progress messages received: 262 (2/sec)
Backprocessing history up to requested target block 19359900:
(hit 'm' to switch mode)

Stage 0: map_deposits,graph_out

stage 0 (0 jobs)           19359000  ::  19359000-19360000

  Slowest modules:
    map_deposits -       43ms per block

Turning on debug logging doesn't give any more info.

$ grep 7e01eecadd1fd8f3e3127713c1a7cc06 /var/log/syslog
Mar  7 16:50:10 eth-sfst79 sf[339345]: 2024-03-07T16:50:10.027Z INFO (substreams-tier1.tier1) incoming Substreams Blocks request {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "start_block": 19359900, "stop_block": 19360000, "cursor": "", "modules": ["map_deposits", "graph_out"], "output_module": "graph_out", "output_module_hash": "ab5820fe6e90fe07d221ce6620219fe0c79544cf", "production_mode": true}
Mar  7 16:50:10 eth-sfst79 sf[339345]: 2024-03-07T16:50:10.033Z DEBG (substreams-tier1.tier1) initializing tier1 pipeline {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "plan": "interval=1000, stores=[nil), map_write=[19359000, 19360000), map_read=[19359900, 19360000), linear=[nil)", "request_start_block": 19359900, "resolved_start_block": 19359900, "request_stop_block": 19360000, "request_start_cursor": "", "resolved_cursor": "", "max_parallel_jobs": 400, "output_module": "graph_out"}
Mar  7 16:50:10 eth-sfst79 sf[339345]: 2024-03-07T16:50:10.210Z DEBG (substreams-tier1.tier1) initializing worker pool {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "worker_count": 400}
Mar  7 16:50:10 eth-sfst79 sf[339345]: 2024-03-07T16:50:10.210Z DEBG (substreams-tier1.tier1) starting parallel processing {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06"}
Mar  7 16:50:10 eth-sfst79 sf[339345]: 2024-03-07T16:50:10.210Z INFO (substreams-tier1.tier1) scheduler: waiting for output stream to complete, stores ready {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "cached_output_start": 19359, "cached_output_current": 19359, "cached_output_end": 19359}
Mar  7 16:50:10 eth-sfst79 sf[339345]: 2024-03-07T16:50:10.210Z INFO (substreams-tier1.tier1) scheduling work {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "unit": {"segment": 19359, "stage": 0}}
Mar  7 16:50:10 eth-sfst79 sf[339345]: 2024-03-07T16:50:10.211Z DEBG (substreams-tier1.tier1) loading execout file {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "module": "graph_out", "file_name": "0019359000-0019360000.output", "block_range": {"start_block": 19359000, "exclusive_end_block": 19360000}}
Mar  7 16:50:10 eth-sfst79 sf[339345]: 2024-03-07T16:50:10.211Z INFO (substreams-tier1.tier1) launching remote worker {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "start_block_num": 19359000, "stop_block_num": 19360000, "output_module": "graph_out"}
Mar  7 16:50:55 eth-sfst79 sf[339345]: 2024-03-07T16:50:55.125Z DEBG (substreams-tier1.tier1) job completed {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "unit": {"segment": 19359, "stage": 0}, "number_of_tries": 0, "module_name": ["graph_out"], "duration": 44.91427046, "processing_time_per_block": 0.044914270459999996}
Mar  7 16:50:55 eth-sfst79 sf[339345]: 2024-03-07T16:50:55.125Z INFO (substreams-tier1.tier1) scheduler: waiting for output stream to complete, stores ready {"trace_id": "7e01eecadd1fd8f3e3127713c1a7cc06", "cached_output_start": 19359, "cached_output_current": 19359, "cached_output_end": 19359}

Nothing further

Using latest firehose-ethereum code here (v2.3.5) with RPC cache turned off.

matthewdarwin commented 6 months ago

tier1 config:

start:
  args:
    - substreams-tier1
  flags:
    log-verbosity: 2
    common-live-blocks-addr: dns:///eth-sf-relayer:9000
    common-merged-blocks-store-url: s3://...
    common-one-block-store-url: s3://...
    common-forked-blocks-store-url: s3://...
    substreams-rpc-endpoints: http://eth-erigon:8545
    substreams-state-store-url: s3://....
    substreams-tier1-grpc-listen-addr: :9000
    substreams-tier1-max-subrequests: 400
    substreams-tier1-subrequests-endpoint: localhost:5000 # this is envoy proxy to tier2

sduchesneau commented 6 months ago

Hi @matthewdarwin I cannot reproduce here with v2.3.5 or with the latest (develop).

I ran this command:substreams -e localhost:9000 run --insecure ipfs://QmRwHWApq6SnvEzy3RUBd5j9WmbRVgFQGTsXwmfhh79uj5 graph_out -s 19359900 -t 19360000 --production-mode

Could you try maybe running both nodes on the same machine with a local state store URL to see if your S3 server maybe getting stuck (preventing the tier1 from reading the file written from the other node) ?

That's one of the very few possible things that I see from your described behavior...

matthewdarwin commented 5 months ago

I did some more debugging just now on latest antelope-firhose (will all the latest changes from today). I put tier1 and tier2 on the same node, and tested the difference between local storage and s3. The problem happens with s3. I put debug on level 4, but there is still no info to debug it

We need some tracing on all the s3 HTTP requests?

4cc7f8ed86fc278557ac9efbee5db0ae.log

config:

start:
  args:
    - substreams-tier1
    - substreams-tier2
  flags:
    log-verbosity: 4
    log-to-file: false
    common-auth-plugin: ....
    common-metering-plugin: ...
    common-live-blocks-addr: ...
    common-merged-blocks-store-url: s3://...
    common-one-block-store-url: s3://...
    common-forked-blocks-store-url: s3://...
    common-system-shutdown-signal-delay: 10s
    common-auto-mem-limit-percent: 90
    substreams-state-store-url: s3://...
    substreams-tier1-grpc-listen-addr: :9000

Test substream: substreams run -e wax.substreams.pinax.network:443 notify-actions-v0.2.0.spkg jsonl_out -s -1000 --params=map_logactions="aw_land_id=1099512960477,1099512961456,1099512959292,1099512960182,1099512958801,1099512961178" --production-mode

(see eosnation dfuse telegram for discussion with the impacted user)

sduchesneau commented 5 months ago

Fixed here https://github.com/streamingfast/substreams/commit/fa91cf14ac98372f64161bef68829ade2ccaeba9 confirmed with @matthewdarwin, thanks for your help debugging this!

streamingfast / substreams

substreams stuck switching from backprocessing to live #421