substreams-tier1 does not handle well the case that tier2 runs out of RAM and crashes

matthewdarwin commented 1 year ago

substreams-tier1 does not handle well the case that tier2 runs out of RAM an crashes.

The tier2 container runs out of RAM and crashes and then tier1 reports

INFO (substreams-tier1.tier1) job failed {"trace_id": "de189604f399ffcb71b198ae1ca1a17c", "job": {"module_name": "kv_out_transaction_traces", "start_block": 112163000, "end_block": 112173000}, "error": "receiving stream resp: rpc error: code = Unavailable desc = error reading from server: EOF"}

and then all the tier2 jobs get cancelled.

For this test there are 26 containers, one container running out of RAM causes the jobs on all the other 25 containers to exit as well. This wastes a lot of resources as things get re-dispatched again.

I'm using substreams-sink-noop on EOS (Antelope) for testing.

maoueh commented 1 year ago

Probably missing some error to be flagged as retryable here https://github.com/streamingfast/substreams/blob/develop/orchestrator/work/worker.go#L100-L113, retryable errors are flagged within https://github.com/streamingfast/substreams/blob/develop/orchestrator/work/worker.go#L139

matthewdarwin commented 1 year ago

The same behaviour happens when I try sending > 1024 concurrent requests to envoy via substreams-tier1-max-subrequests: 2000. Envoy by default has max_connections set to 1024, so establishing more than that is an error and tier1 doesn't handle it nicely.

streamingfast / substreams

substreams-tier1 does not handle well the case that tier2 runs out of RAM and crashes #261