graphprotocol / graph-node

Graph Node indexes data from blockchains such as Ethereum and serves it over GraphQL
https://thegraph.com
Apache License 2.0
2.89k stars 962 forks source link

[Bug] Various reports that substream-powered-subgraphs silently stop processing blocks #4863

Closed chriswessels closed 5 months ago

chriswessels commented 1 year ago

Bug report

Reports from StakeSquid, Pinax, Data Nexus all confirm that substream-powered-subgraphs periodically stop processing blocks and silently fail. Reports that restarting graph-node fixes it.

Relevant log output

No response

IPFS hash

No response

Subgraph name or link to explorer

No response

Some information to help us out

OS information

None

colindickson commented 1 year ago

In my experience as an indexer operator:

We used to have an issue where the substreams-based-subgraphs would stop syncing. The graph-node would seemingly fail to reconnect if there was a disconnection and blocks would stop flowing indefinitely. We would have to restart the graph-node.

This seems to fit the description of the bug reported here.

I have alerts set up to warn me if this happens, and they used to happen frequently - about 2 or 3 times a week.

However, ever since we upgraded our indexer graph-nodes to v0.32.0-rc.0 and now v0.32.0, we have not had this issue at all.

We are running our indexer with the StreamingFast endpoints for both firehose and substreams.


The StreamingFast team is ready to assist in any debugging for this issue.

If you are encountering this issue, can you please provide the following:

mindstyle85 commented 1 year ago

Hi, I am adding this here although our bug is a bit different. To first explain, we have 2 indexers, same exact setup on both in terms of config for graph-node, and both have version 32.

When we added substreams to the graph-node config on our primary indexer, it stopped syncing about 1/3 of our ETHEREUM only subgraphs, while the substream subgraph was syncing normally. A restart of the graph-node solved the stalling for a short while (less than 5min), but then those same subgraphs got stuck again. When we removed the susbtreams from the graph-node config and just left ethereum there, all the ethereum subgraphs started to sync again normally. When i switched back again, they stalled again. Note that the config is exactly the same as our other indexer, where it all works normally. Not one thing is different.

Unfortunately i cant let you know about versions etc since we just tested it and used a friends endpoint. I was also scouring logs but did not find nothing at all that would indicate any warning, error or anything that would explain why the ethereum subgraphs stopped syncing.

maoueh commented 1 year ago

@mindstyle85 Can you describe your friends infra a bit, like is it a single machine deployment running everything? What is the loose specs of the machine(s) and disks used to serve Firehose blocks. Also, how many Subgraphs were served by a Firehose provider when you perform your tests.

What Firehose provider config in graph-node were you using when you tried it? (you can mask endpoint/keys)

trader-payne commented 1 year ago

This is my experience running substreams based subgraphs:

docker kill index-node-3 && docker restart index-node-3 && docker logs index-node-3 -f --tail 1000000 2>&1 | grep -i QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag
index-node-3
index-node-3
Sep 13 21:17:37.798 INFO Resolve subgraph files using IPFS, n_templates: 0, n_data_sources: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 13 21:17:40.991 INFO Successfully resolved subgraph files using IPFS, features: ø, n_templates: 0, n_data_sources: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 13 21:17:41.002 INFO Starting subgraph writer, queue_size: 5, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 13 21:17:41.088 INFO Data source count at start: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 13 21:17:41.095 INFO Blockstreams disconnected, connecting, provider_err_count: 0, cursor: xeKhcurHMv_Ro6Ttwgb0naWwLpcyB1JuVwLiIxdL1IKn-SfF3sj0AGR1bhvTmarziRHsHVyu2dnOR3tz-pRWutTix7k36iBqFy54movs-LG-fKGjO14cce9gWrjfN9LaUz3Vaw36fLsJtdXnPPTQbkcwZ8R1fzKyiz5WpIEAJvJA6iE3y236cpqHgqrCp9QUrOd0RuSnxH_2UmR9KklSbJqGYPKduTwm, start_block: 18129797, subgraph: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, endpoint_uri: substreams, provider: substreams, deployment: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubstreamsBlockStream
Sep 13 21:17:41.246 INFO Blockstreams connected, provider: substreams, deployment: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubstreamsBlockStream

That's it! No more logs, no errors, not advancing blocks, nada!

Tested with my own endpoint and Pinax's endpoint. Restating the index node a dozen times, sometimes it moves it forward.

Specs of the machines: graph-node machine - runs 7 index nodes and 1 query node - 5950X (16c/32t) / 128gb ram ddr4 / 2 nvme raid0 fire-hose all-in-one machine - runs firehose + substreams stack altogether - 5950X (16c/32t) / 128gb ram ddr4 / 3 nvme raid0 postgres db machine - runs only postgres, one shard, one instance - EPYC 7502P (32c/64t) / 128gb ram ddr4 / 5 nvme raid0

Load average on the machines: graph-node machine: image firehose machine: image postgres db machine: image

To be noted that the same behavior occurs whether or not my firehose machine is loaded liek that. The load right now comes from me syncing a handful of eth mainnet subs. And again, same behaviour seen on Pinax's endpoints.

Graph-node config toml file:

[store]
[store.primary]
connection = "postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:5432/${GRAPH_NODE_DB_NAME}"
pool_size = 25

[chains]
ingestor = "index_node_0"

[chains.${CHAIN_0_NAME}]
shard = "primary"
protocol = "ethereum"
provider = [
{ label = "substreams", details = { type = "substreams", url = "${CHAIN_0_RPC_SUBSTREAMS}", token = "${STREAMING_FAST_API_TOKEN}", features = ["compression", "filters"], conn_pool_size = 1 } },
{ label = "firehose", details = { type = "firehose", url = "${CHAIN_0_RPC_FIREHOSE}", token = "${STREAMING_FAST_API_TOKEN}", features = ["compression", "filters"], conn_pool_size = 1 } },
{ label = "archive", url = "${CHAIN_0_RPC_ARCHIVE}", features = ["archive", "traces"] }
]

[chains.${CHAIN_1_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_1_NAME}", url = "${CHAIN_1_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_2_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_2_NAME}", url = "${CHAIN_2_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_3_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_3_NAME}", url = "${CHAIN_3_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_4_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_4_NAME}", url = "${CHAIN_4_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_5_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_5_NAME}", url = "${CHAIN_5_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_6_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_6_NAME}", url = "${CHAIN_6_RPC}", features = ["archive", "traces"] } ]

[deployment]
[[deployment.rule]]
shards = "primary"
indexers = [ "index_node_0", "index_node_1", "index_node_2", "index_node_3", "index_node_4", "index_node_5", "index_node_6" ]

[general]
query = "query_node_0"

firehose docker compose yaml:

  fireeth-mainnet:
    build:
      context: ./firehose-mainnet/
    command: -c /etc/firehose/config.yml start
    environment:
      - "ETHEREUM_ARCHIVE_RPC=${FIREHOSE_ETHEREUM_ARCHIVE_RPC:-https://rpc-fi-1.stakesquid.eu/erigon}"
      - "S3_BLOCKS_STORE_URL=${FIREHOSE_S3_BLOCKS_STORE_URL:-s3://162.55.135.148:8333/eth-blocks/}"
      - "AWS_ACCESS_KEY_ID=thegraph"
      - "AWS_SECRET_ACCESS_KEY=stakesquid"
      - "DSTORE_S3_BUFFERED_READ=true"
    expose:
      - 8545 # geth rpc
      - 8551 # consensus client
      - 13042 # firehose grpc
      - 13044 # substreams tier 1
      - 13045 # substreams tier 2
    ports:
      - "24465:24465"
      - "24465:24465/udp"
      - "13042:13042"
      - "13044:13044"
      - "13045:13045"
    labels:
      - "traefik.enable=true"
      - "traefik.http.services.firehose-mainnet.loadbalancer.server.scheme=h2c"
      - "traefik.http.services.firehose-mainnet.loadbalancer.server.port=13044"
      - "traefik.http.routers.firehose-mainnet.entrypoints=websecure"
      - "traefik.http.routers.firehose-mainnet.tls.certresolver=myresolver"
      - "traefik.http.routers.firehose-mainnet.rule=Host(`firehose-mainnet.$DOMAIN`)"
      - "traefik.http.routers.firehose-mainnet.middlewares=ipwhitelist"      
    networks:
      - chains
    volumes:
      - "./firehose-mainnet/config.yml:/etc/firehose/config.yml"
      - "geth-mainnet_data:/var/lib/geth"
      - "firehose-mainnet_data:/var/lib/firehose"
      - ".jwtsecret:/jwtsecret"
    restart: unless-stopped
    stop_grace_period: 1m

firehose config:

start:
  args:
  - merger
  - firehose
  - reader-node
  - relayer
  - combined-index-builder
  - substreams-tier1
  - substreams-tier2
  flags:
    data-dir: /var/lib/firehose
    log-to-file: false
    reader-node-log-to-zap: false
    common-chain-id: "1"
    common-network-id: "1"
    reader-node-path: /usr/bin/geth
    substreams-rpc-endpoints: $ETHEREUM_ARCHIVE_RPC
    common-merged-blocks-store-url: s3://162.55.135.148:8333/eth-blocks/eth-blocks?region=none&insecure=true
    reader-node-arguments: "--mainnet --datadir=/var/lib/geth
             --ipcpath=/var/lib/firehose/reader/ipc
             --http --http.api=eth,net,web3
             --authrpc.jwtsecret=/jwtsecret
             --authrpc.addr=0.0.0.0
             --authrpc.vhosts=*
             --http.port=8545 --http.addr=0.0.0.0 --http.vhosts=*
             --firehose-enabled --port=24465 --cache=2048"

Dockerfile used to build firehose:

FROM debian
RUN apt-get update && apt-get install -y wget
RUN wget -O- https://github.com/streamingfast/firehose-ethereum/releases/download/v1.4.14/firehose-ethereum_linux_x86_64.tar.gz | tar xvz -C /usr/bin
RUN wget https://github.com/streamingfast/go-ethereum/releases/download/geth-v1.12.2-fh2.3-1/geth_linux -O /usr/bin/geth
RUN chmod +x /usr/bin/geth

ENTRYPOINT ["fireeth"]

The whole Firehose stack runs without SSL cus I can't make it work with traefik and I really can't be bothered to do so, it should run very well without :)

colindickson commented 1 year ago

@trader-payne if you activate debug logs on your index nodes, can you see any additional information?

can you confirm that you're running graph-node v0.32.0 ?

trader-payne commented 1 year ago

Stacks versions here:

INDEXER_AGENT_VERSION="ghcr.io/graphprotocol/indexer-agent:v0.20.22"
INDEXER_SERVICE_VERSION="ghcr.io/semiotic-ai/autoagora-indexer-service:v0.1.2-v0.20.22"
GRAPH_NODE_VERSION="graphprotocol/graph-node:v0.32.0"

Will activate debug and will post here if I get more logs. Gotta close some allos now that I have those subs up to chainhead for an hour lol

madumas commented 1 year ago

Issue occured here today for two subgraphs, with slightly different behavior

1) QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag Graph-node stalled on block 18128999. At least one restart of graph-node wasn't enough to resume the indexation, but it resumed immediately after restart the Substream Tier 1 processes.

Sep 14 23:44:48.015 INFO Blockstreams disconnected, connecting, provider_err_count: 5, cursor: p_z44DJJ5njlKnwsDAxFVaWwLpc_DFtnVAntIxhIj4r0o3rB2JXyVGB1bRnUmaCj1EPiGlypjYqYHX8t-pRZu4e7xrEw7iRtTn19m4Do-bfnLPT2bAwYc-piDO-MYdzRWjTWYALyc7gK4de0MvfcbxZmY8cjKGK6hmxZ9IwFdqZC7CY8wmr1J8bVhqqT-NZC-rctQrWpnC2hBD0pK0sPO8-ANvLMuj4jYQ==, start_block: 18128000, subgraph: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, endpoint_uri: substreams, provider: substreams, deployment: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, sgd: 54, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubstreamsBlockStream
Sep 14 23:44:48.047 INFO Blockstreams connected, provider: substreams, deployment: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, sgd: 54, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubstreamsBlockStream
Sep 14 23:44:55.454 DEBG Start processing block, triggers: 1, block_hash: 0xeb243b56b0b25a5fe87a80be5af6427834ea42e9b8d6d3705e4540be52662589, block_number: 18129000, sgd: 54, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager

2) Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf Stalled at a graph-node restart, i.e. didn't start indexing. Restarting firehose tier1 didn't fix it. After, another graph-node restart got it to catch up to block 18136999 (the latest xxxx999 block), no errors in graph-node logs. A third graph-node restart finally reached chain-head.

graph-node:v0.32.0 Internal substream endpoint running firehose-ethereum:v1.4.10

I don't have logs from the moment the indexation stalled, as these were rotated. This is concurring on fairly regular basis, every 2-3 days, usually on a xxxxxxx999 block.

This may have to do with the transition between the handling of cached substream from tier2 and live processing on tier1.

trader-payne commented 1 year ago

@colindickson activated debug logs, there's nothing more in the logs than I've already provided.

azf20 commented 1 year ago

thanks for the details here - hopefully #4866 can help with some debugging, is there anything else that might help investigation on the Graph Node side here? the xx999 block is an interesting clue.

mindstyle85 commented 1 year ago

@mindstyle85 Can you describe your friends infra a bit, like is it a single machine deployment running everything? What is the loose specs of the machine(s) and disks used to serve Firehose blocks. Also, how many Subgraphs were served by a Firehose provider when you perform your tests.

What Firehose provider config in graph-node were you using when you tried it? (you can mask endpoint/keys)

Hey, sorry for late reply. I still dont know the exact specs but afaik its all 1 really strong server (ax102 eq or better) with enough disk space.

Our setup is also 1 server with 32 cores 128GB ram and 12tb nvme disks. Usage of cpu and ram is about 10% steady.

Config is basically the same as payne posted but without the token part. Graph-node is 0.32.0.

I just did another test with yet another endpoint besides the first one, and in both cases same thing has happened. When i switch to a substream config, it starts syncing the substream subgraph, but about 70 (out of approx 200) ethereum subgraphs get stalled. After restarting, about 20 of those 70 stalled ones pick back up for a minute or so then stall again, while the other 50 stay stalled. I did this flow 3 times now and its the same result each time.

I also dont see anything in the logs at all about this and i also dont notice any particular block height at which this would happen.

Sorry i cant help more but there really isnt much to go on with. I will see if i can get 1 or 2 more endpoints to test, to see if its really the same with all of them.

jhjhjh94 commented 1 year ago

Currently running on testnet chain (Eth Goerli) .env

INDEXER_AGENT_VERSION="ghcr.io/graphprotocol/indexer-agent:v0.20.22"
INDEXER_SERVICE_VERSION="ghcr.io/semiotic-ai/autoagora-indexer-service:v0.1.2-v0.20.22"
GRAPH_NODE_VERSION="graphprotocol/graph-node:v0.32.0"

config.toml

[store]
[store.primary]
connection = "postgresql://${DB_USER}:${DB_PASS}/graph-node-db0?host=/var/run/postgresql"
pool_size = 10

[chains]
ingestor = "index_node_0"

[chains.mainnet]
shard = "primary"
provider = [
             { label = "substreams", details = { type = "substreams", url = "$SUBSTREAMS",  features = ["compression", "filters"], conn_pool_size = 100 } },
             { label = "peering", url = "$RPC", features = ["archive", "traces"] } ]

[deployment]
[[deployment.rule]]
indexers = [ "index_node_0" ]

[general]
query = "query_node_0"

Logs

Sep 15 10:08:14.855 DEBG Start processing block, triggers: 1, block_hash: 0xa791f522bf96ffd256fe17ec432acd9f860237e5910de5ea4d839c16779dd56d, block_number: 18140927, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:14.856 DEBG Entity cache statistics, evict_time_ms: 0, accesses: 0, hit_rate: 100%, stale_update: false, evicted_count: 0, count: 3812, evicted_weight: 0, weight: 1863968, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:14.856 INFO Applying 2 entity operation(s), block_hash: 0xa791f522bf96ffd256fe17ec432acd9f860237e5910de5ea4d839c16779dd56d, block_number: 18140927, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:14.860 DEBG Start processing block, triggers: 1, block_hash: 0xb3b3cd1b3a4fbffb386caf8fb2213c4264229ae082b0210eb64c83f9d0aec616, block_number: 18140928, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:14.861 DEBG Entity cache statistics, evict_time_ms: 0, accesses: 0, hit_rate: 100%, stale_update: false, evicted_count: 0, count: 3813, evicted_weight: 0, weight: 1864457, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:14.861 INFO Applying 2 entity operation(s), block_hash: 0xb3b3cd1b3a4fbffb386caf8fb2213c4264229ae082b0210eb64c83f9d0aec616, block_number: 18140928, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:25.127 DEBG Start processing block, triggers: 1, block_hash: 0xb828e52d1f89c75094c53c2f5b21891f8d5e320d41b9762826711b2790d0bc8d, block_number: 18140929, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:25.127 DEBG Entity cache statistics, evict_time_ms: 0, accesses: 0, hit_rate: 100%, stale_update: false, evicted_count: 0, count: 3814, evicted_weight: 0, weight: 1864947, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:25.127 INFO Applying 2 entity operation(s), block_hash: 0xb828e52d1f89c75094c53c2f5b21891f8d5e320d41b9762826711b2790d0bc8d, block_number: 18140929, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:32.785 INFO Committed write batch, time_ms: 17950, weight: 4535480, entities: 7618, block_count: 3809, block_number: 18140925, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:32.834 INFO Committed write batch, time_ms: 43, weight: 2137, entities: 2, block_count: 1, block_number: 18140926, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:32.849 INFO Committed write batch, time_ms: 15, weight: 2137, entities: 2, block_count: 1, block_number: 18140927, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:32.865 INFO Committed write batch, time_ms: 15, weight: 2137, entities: 2, block_count: 1, block_number: 18140928, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager
Sep 15 10:08:32.877 INFO Committed write batch, time_ms: 12, weight: 2138, entities: 2, block_count: 1, block_number: 18140929, sgd: 1, subgraph_id: QmeWyu8pzV3zViL5nz813NxKSab7REh9qJBivjzkCzE9Vw, component: SubgraphInstanceManager

Subgraphs

https://testnet.thegraph.com/explorer/subgraph?id=ApnvbDR5gExNyW8LvWopSTkd88DK78njG7CbeYu8pjH3&chain=goerli (Low chances of stalling)
https://testnet.thegraph.com/explorer/subgraph?id=HUZDsRpEVP2AvzDCyzDHtdc64dyDxx8FQjzsmqSg4H3B&chain=goerli (Always stalls)
https://testnet.thegraph.com/explorer/subgraph?id=3e63xu6PWvSGphm2hPbKtvh4rziS2jYSqxhz5rmRmcnx&chain=goerli (Always stalls)

Sometimes restart works / doesn't work on syncing the subgraphs (non-substreams subgraph sometimes affected as well) Few hours later subgraph totally stops syncing out of sudden (Randomly)

trader-payne commented 1 year ago

^I'm having the exact same behavior described here. With either my own endpoint or theirs.

EDIT: adding status screenshot, even with regular restarts, the substreams are 11h behind and 3h respectively image

Restarted the WHOLE stack, all index nodes, and other stuff, and only one stream advanced a bit, now both stuck again: image

mindstyle85 commented 1 year ago

@mindstyle85 Can you describe your friends infra a bit, like is it a single machine deployment running everything? What is the loose specs of the machine(s) and disks used to serve Firehose blocks. Also, how many Subgraphs were served by a Firehose provider when you perform your tests. What Firehose provider config in graph-node were you using when you tried it? (you can mask endpoint/keys)

Hey, sorry for late reply. I still dont know the exact specs but afaik its all 1 really strong server (ax102 eq or better) with enough disk space.

Our setup is also 1 server with 32 cores 128GB ram and 12tb nvme disks. Usage of cpu and ram is about 10% steady.

Config is basically the same as payne posted but without the token part. Graph-node is 0.32.0.

I just did another test with yet another endpoint besides the first one, and in both cases same thing has happened. When i switch to a substream config, it starts syncing the substream subgraph, but about 70 (out of approx 200) ethereum subgraphs get stalled. After restarting, about 20 of those 70 stalled ones pick back up for a minute or so then stall again, while the other 50 stay stalled. I did this flow 3 times now and its the same result each time.

I also dont see anything in the logs at all about this and i also dont notice any particular block height at which this would happen.

Sorry i cant help more but there really isnt much to go on with. I will see if i can get 1 or 2 more endpoints to test, to see if its really the same with all of them.

Ok i just testnet it with another endpoint, but this time config looks a bit different:

[chains.mainnet]
shard = "primary"
provider = [ { label = "mainnet", url = "eth rpc", features = [ "archive", "traces" ] }, 
{ label = "substreams", details = { type = "substreams", url = "substream rpc" }} ]

And the substreams subgraph is syncing (at least for now) and no stalling on ethereum subgraphs. Suntzu said he will post his build when he can.

azf20 commented 1 year ago

Thanks for sharing - those details. I think you can remove the following from the substreams config (though I would be surprised if that is the root case) - @trader-payne can you try as I see you have some of that info in your substreams provider too?

features = ["compression", "filters"], conn_pool_size = 100

I don't know why there would be an impact on non-substreams subgraphs, does block ingestion via the RPC continue as normal when those RPC subgraphs stop?

azf20 commented 1 year ago

@mindstyle85 @trader-payne can you also confirm whether those Ethereum subgraphs are syncing with a Firehose or RPC (maybe it is both, @trader-payne's config seemed to have a firehose in there but @mindstyle85's did not)

maoueh commented 1 year ago

@trader-payne @mindstyle85 @madumas @jhjhjh94 I would like that we separate tracking of the Subgraphs through Firehose and Substreams powered subgraphs.

Those are two completely different components, both in terms of Firehose code as well as within graph-node. I read two different problems in this issue and they could be related (a gRPC stack problem within Rust would be a shared bug) but there is a good chance there is different things going on.

can help with some debugging, is there anything else that might help investigation on the Graph Node side here? the xx999 block is an interesting clue.

So we are missing visibility into what is happening on the server's side serving Substreams because some occurrence of long delay could be "normal" behavior (I say normal but we could more say "actual" since improvements could be done here, see below for more details)

You will often see a Substreams stream be "stuck" on a x999 boundary, let me explain you why. In production mode, the processing and the streaming back of data is done in separate processes. The processing is done in chunk of 1000 blocks on at end of which final map outputs are saved to disk.

It's only when this bundle is finally available that data is streamed back. Because of this behavior, a Substreams could be seen stuck while it's still progressing normally, it's just that no visibility was offered to see this. If a segment of 1000 blocks take X minutes to process (because for example it makes at lot of eth_call or CPU is overloaded and has a hard time executing WASM code), then you will not receive blocks.

And because of dependencies graph of modules, you could need to execute 30 of such modules before having the data for 1000 blocks. It would be perceived that Substreams is stuck but it's actually processing normally.

Another things that can happen on the Substreams side is let's say the configured RPC provider (within Substreams stack) is broken. This make Substreams retries eth_call until it succeed which could lead also to visible stuckness on the graph-node side while it's really stuck. Again, it's a visibility problem here.

Now, to add to further more problem. Let's say you restart the graph-node process thinking it will help. Due to missing feature on Substreams side, it creates even more problem. Indeed, some partials computation within this 1000 blocks segments are actually discarded. This means on restart, we start back from the start block of the 1000 blocks boundary. Leading to the impression that it will never finish.

Having long inactivity period on gRPC often also kicks sometimes weird networking issue. Users reported stream restarting more often than it should recently when using the Rust sink example (see https://github.com/streamingfast/substreams/issues/277). This is again a problem as work is being re-done from scratch on stream restart, leading to never ending work.

All above, I'm talking about Substreams. Firehose is a different beast and on this particular issue, I think the stuckness is more in graph-node then in Golang (it's my gut feelings talking here, I could be wrong). Hence my desire to split the issues in two.

maoueh commented 1 year ago

Also, please note that I'm not assessing there is no bugs/problems, just describing actual behavior so that everyone can better understand what is happening under the hood.

About observability, it will be the first step to start debugging Substreams based issue. We have revamped progress messages sent back from the endpoint in the stream in the last month and this has not been re-surfaced yet within graph-node and it will be a major help here to identify wether the endpoint is functioning properly or the stream is stuck.

We have now steady progress messages coming back to the client (2 message per second, something like this) and the progress messages. Those messages contains every pieces of details about what the endpoint is currently doing, what are its current processing jobs, last block processed, last available linear block and much more. It also contains timing information, for example around eth_call performed.

As well as the trace_id, we will need to bring those back into graph-node with a formatted log (we could probably print less often within graph-node, for example present the latest received progress each 15s or something like that).

In the long run about observability, we would also like that it's possible to "attach" to graph-node (or directly to the endpoint providing a trace_id) and you could get a nice UI about what is happening on the Substreams perspective. Could be well integrated within graph-node, but that is not a short term task, will see. For this see issue #4826.

trader-payne commented 1 year ago

Thanks for sharing - those details. I think you can remove the following from the substreams config (though I would be surprised if that is the root case) - @trader-payne can you try as I see you have some of that info in your substreams provider too?

features = ["compression", "filters"], conn_pool_size = 100

I don't know why there would be an impact on non-substreams subgraphs, does block ingestion via the RPC continue as normal when those RPC subgraphs stop?

My config has that set to 1, but I'll remove them completely to see what happens. I'll be back with another comment.

EDIT: Removed that and I get the same behavior:

Sep 15 14:19:12.021 DEBG Subgraph started, start_ms: 383, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphRegistrar
Sep 15 14:19:12.025 INFO Resolve subgraph files using IPFS, n_templates: 0, n_data_sources: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 14:19:13.251 INFO Successfully resolved subgraph files using IPFS, features: ø, n_templates: 0, n_data_sources: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 14:19:13.260 INFO Starting subgraph writer, queue_size: 5, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 14:19:13.332 INFO Data source count at start: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 14:19:13.337 DEBG Starting or restarting subgraph, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 14:19:13.338 DEBG Starting block stream, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 14:19:13.338 INFO Blockstreams disconnected, connecting, provider_err_count: 0, cursor: bQCVv8uCWmbsfGBkEGuNsKWwLpc_DFtnVAjsIxhIj4ui9nTMjsyiUmggax3YxKv1jhfjGQuu24qcECt88cBY79nolro17ygwQ3wpkYDq-rXsfKD3OAtOee5lCurdZdHRWjTWYQPyc7gK4IHhPPqKNkZga5IlLG7njToDoI0GIaEU7CIxljv-c8eB2PnD89ND9uogQ-GjnC-iBjZ5f0pbPJmKMvXKv28nbA==, start_block: 18140000, subgraph: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, endpoint_uri: substreams, provider: substreams, deployment: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubstreamsBlockStream
Sep 15 14:19:13.437 INFO Blockstreams connected, provider: substreams, deployment: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubstreamsBlockStream

with the config toml file looking like:

[store]
[store.primary]
connection = "postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:5432/${GRAPH_NODE_DB_NAME}"
pool_size = 25

[chains]
ingestor = "index_node_0"

[chains.${CHAIN_0_NAME}]
shard = "primary"
protocol = "ethereum"
provider = [
{ label = "substreams", details = { type = "substreams", url = "${CHAIN_0_RPC_SUBSTREAMS}" } },
{ label = "firehose", details = { type = "firehose", url = "${CHAIN_0_RPC_FIREHOSE}", features = ["compression", "filters"], conn_pool_size = 1 } },
{ label = "archive", url = "${CHAIN_0_RPC_ARCHIVE}", features = ["archive", "traces"] }
]

[chains.${CHAIN_1_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_1_NAME}", url = "${CHAIN_1_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_2_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_2_NAME}", url = "${CHAIN_2_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_3_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_3_NAME}", url = "${CHAIN_3_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_4_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_4_NAME}", url = "${CHAIN_4_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_5_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_5_NAME}", url = "${CHAIN_5_RPC}", features = ["archive", "traces"] } ]

[chains.${CHAIN_6_NAME}]
shard = "primary"
provider = [ { label = "${CHAIN_6_NAME}", url = "${CHAIN_6_RPC}", features = ["archive", "traces"] } ]

[deployment]
[[deployment.rule]]
shards = "primary"
indexers = [ "index_node_0", "index_node_1", "index_node_2", "index_node_3", "index_node_4", "index_node_5", "index_node_6" ]

[general]
query = "query_node_0"
trader-payne commented 1 year ago

More details on what I'm seeing on my end (gonna keep adding to this):

Scenario 1 - my erigon + pinax substream, no firehose defined

[chains.${CHAIN_0_NAME}]
shard = "primary"
protocol = "ethereum"
provider = [
{ label = "substreams", details = { type = "substreams", url = "${CHAIN_0_RPC_SUBSTREAMS}", token = "${STREAMING_FAST_API_TOKEN}" } },
{ label = "archive", url = "${CHAIN_0_RPC_ARCHIVE}", features = ["archive", "traces"] }
]

About 50-100 subgraphs stop from syncing, substreams work.

Logs of a subgraph of this kind:


index-node-2
WARNING: The NETWORK_SUBGRAPH_DEPLOYMENT variable is not set. Defaulting to a blank string.
Recreating index-node-2 ... done
Sep 15 16:53:22.679 DEBG Subgraph started, start_ms: 435, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphRegistrar
Sep 15 16:53:22.682 INFO Resolve subgraph files using IPFS, n_templates: 0, n_data_sources: 2, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:53:24.972 INFO Data source count at start: 2, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:53:25.151 DEBG Requesting 0 block(s), sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: BlockStream
Sep 15 16:53:25.160 DEBG Starting or restarting subgraph, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:53:25.160 DEBG subscribing to chain head updates, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: BlockStream
Sep 15 16:53:25.162 DEBG Starting block stream, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager

Scenario 2 - adding Firehose in the mix (using Pinax's endpoint still):

[chains.${CHAIN_0_NAME}]
shard = "primary"
protocol = "ethereum"
provider = [
{ label = "substreams", details = { type = "substreams", url = "${CHAIN_0_RPC_SUBSTREAMS}", token = "${STREAMING_FAST_API_TOKEN}" } },
{ label = "firehose", details = { type = "firehose", url = "${CHAIN_0_RPC_FIREHOSE}", token = "${STREAMING_FAST_API_TOKEN}", features = ["compression", "filters"], conn_pool_size = 1 } },
{ label = "archive", url = "${CHAIN_0_RPC_ARCHIVE}", features = ["archive", "traces"] }
]

Immediately gets the stuck subgraphs going (cc @azf20):

root@omega ~/graphprotocol-omega-docker main* 19s ❯ docker kill index-node-2 && bash start.all --force-recreate index-node-2 && docker logs index-node-2 -f --tail 100 2>&1 | grep -i QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr

index-node-2
WARNING: The NETWORK_SUBGRAPH_DEPLOYMENT variable is not set. Defaulting to a blank string.
Recreating index-node-2 ... done
Sep 15 16:54:56.777 DEBG Subgraph started, start_ms: 399, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphRegistrar
Sep 15 16:54:56.797 INFO Resolve subgraph files using IPFS, n_templates: 0, n_data_sources: 2, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:54:59.003 INFO Successfully resolved subgraph files using IPFS, features: ø, n_templates: 0, n_data_sources: 2, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:54:59.213 INFO Starting subgraph writer, queue_size: 5, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:54:59.565 INFO Data source count at start: 2, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:54:59.670 DEBG Starting or restarting subgraph, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:54:59.670 DEBG Starting block stream, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:54:59.670 INFO Blockstream disconnected, connecting, provider_err_count: 0, cursor: 76otKnA0xG-9kZ2kqOrcr6WwLpcyB1JuUQntIhJLgY6n9nLF1ZuiUzd0PE_Tlv_13RfvQg6q3dzFFX55ppYEv9Pqw7pguCNrTn0kktjv8rbtLfr2PgsaIO1mC-OIYdLaUz3TYAL7eLsI49XjOPWNNUZnYZR3LmSx22tX84NVdvISvyBglTX6e87Wgv2f8YYUqbN3F-2mky2jBjZ1Kk0JbMWHZvHMuD0h, subgraph: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, start_block: 18142884, endpoint_uri: firehose, provider: firehose, deployment: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: FirehoseBlockStream
Sep 15 16:55:00.642 DEBG Start processing block, triggers: 0, block_hash: 0x9f7cb3c5bf2185ae52a19796690035c60e36a90be5f8bc14e05c3a2cd859acb2, block_number: 18142898, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.643 DEBG Entity cache statistics, evict_time_ms: 0, accesses: 0, hit_rate: 100%, stale_update: false, evicted_count: 0, count: 0, evicted_weight: 0, weight: 0, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.659 DEBG Start processing block, triggers: 0, block_hash: 0x3e47811c1eb6dc0ffff1026427d7e48e4fba251b84b5290a6c00648517a29f94, block_number: 18142899, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.660 DEBG Entity cache statistics, evict_time_ms: 0, accesses: 0, hit_rate: 100%, stale_update: false, evicted_count: 0, count: 0, evicted_weight: 0, weight: 0, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.674 DEBG Start processing block, triggers: 0, block_hash: 0x6e9f067a5a8265aa92c0ef4f42fd3b309619ab7bca3d1ccc6dbe823ff090863a, block_number: 18142900, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.674 DEBG Entity cache statistics, evict_time_ms: 0, accesses: 0, hit_rate: 100%, stale_update: false, evicted_count: 0, count: 0, evicted_weight: 0, weight: 0, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.677 DEBG Start processing block, triggers: 0, block_hash: 0xbf8baf31242251b5d50364a2468aa94ffb3aa4656b3e71a1c46fced31941ef14, block_number: 18142901, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.678 DEBG Entity cache statistics, evict_time_ms: 0, accesses: 0, hit_rate: 100%, stale_update: false, evicted_count: 0, count: 0, evicted_weight: 0, weight: 0, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.681 DEBG Start processing block, triggers: 0, block_hash: 0xc94ebfbaff4bd3e7643075f47138fcd110a83101df518f0410d94dbb0d6da52b, block_number: 18142902, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager
Sep 15 16:55:00.683 DEBG Entity cache statistics, evict_time_ms: 0, accesses: 0, hit_rate: 100%, stale_update: false, evicted_count: 0, count: 0, evicted_weight: 0, weight: 0, sgd: 624, subgraph_id: QmTCxfbri7XRST1bnyKpnZUnLNF2KMvX5sQMb2zux4qKhr, component: SubgraphInstanceManager

Scenario 3 - Pinax's substreams + my Firehose:

Subgraphs work fine, everything is up to chainhead. Substreams also work fine!

Scenario 4 - Pinax's Firehose + my Substreams:

Subgraphs work fine, as expected, Substreams stuck with:


index-node-2
WARNING: The NETWORK_SUBGRAPH_DEPLOYMENT variable is not set. Defaulting to a blank string.
Recreating index-node-2 ... done
Sep 15 17:04:46.072 DEBG Subgraph started, start_ms: 383, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubgraphRegistrar
Sep 15 17:04:46.075 DEBG Subgraph started, start_ms: 386, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphRegistrar
Sep 15 17:04:46.075 INFO Resolve subgraph files using IPFS, n_templates: 0, n_data_sources: 1, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubgraphInstanceManager
Sep 15 17:04:46.077 INFO Resolve subgraph files using IPFS, n_templates: 0, n_data_sources: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 17:04:48.464 INFO Successfully resolved subgraph files using IPFS, features: ø, n_templates: 0, n_data_sources: 1, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubgraphInstanceManager
Sep 15 17:04:48.475 INFO Starting subgraph writer, queue_size: 5, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubgraphInstanceManager
Sep 15 17:04:48.609 INFO Data source count at start: 1, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubgraphInstanceManager
Sep 15 17:04:48.646 DEBG Starting or restarting subgraph, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubgraphInstanceManager
Sep 15 17:04:48.646 DEBG Starting block stream, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubgraphInstanceManager
Sep 15 17:04:48.646 INFO Blockstreams disconnected, connecting, provider_err_count: 0, cursor: L7jAv1K48otl0oMDsX_n26WwLpc_DFtnVA_kIxdBj4r28iDC2pnyAGMgPhXZl_2lixG6TAz-2tzIRXp688QD7dC8wOox63RtFHsulY7m8-bqcPL3PFsbee40V-KHY4rRWjTWZgvyfLEK4dXlaPTeYxYyYJJwJG-022oGptRTJvEVunZkxz38d5yD0a2Vo9dHqrd3ROankiOrVTB1LUpfbMyKMqSXtzUhNw==, start_block: 18141961, subgraph: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, endpoint_uri: substreams, provider: substreams, deployment: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubstreamsBlockStream
Sep 15 17:04:48.693 INFO Successfully resolved subgraph files using IPFS, features: ø, n_templates: 0, n_data_sources: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 17:04:48.714 INFO Starting subgraph writer, queue_size: 5, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 17:04:48.748 INFO Blockstreams connected, provider: substreams, deployment: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, sgd: 1474, subgraph_id: Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf, component: SubstreamsBlockStream
Sep 15 17:04:48.818 INFO Data source count at start: 1, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 17:04:48.848 DEBG Starting or restarting subgraph, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 17:04:48.849 DEBG Starting block stream, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubgraphInstanceManager
Sep 15 17:04:48.850 INFO Blockstreams disconnected, connecting, provider_err_count: 0, cursor: DqMc9yimM5oGO1rF0CP7KKWwLpc_DFtnVA_lIxlDj9ij9nPN1Jn0AGEmaRXXkqGiiEDrQwytjd7IEC16oJZZtIC-ku016XVpFXMtmorr_uDvePCkagxNcLI0CbjdYorRWjTWZgrycrMKs4DhO_vQYxAyYpQnJGGxh20F94VcJqJCuHYxkD2vJcbaga_HpNNFq7N2TOWoli6mUzV9LxkJO5qDbqTJ7W8gNw==, start_block: 18140983, subgraph: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, endpoint_uri: substreams, provider: substreams, deployment: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubstreamsBlockStream
Sep 15 17:04:48.890 INFO Blockstreams connected, provider: substreams, deployment: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, sgd: 1356, subgraph_id: QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag, component: SubstreamsBlockStream

Scenario 5 - My Firehose + My substreams

As before, substreams are stuck, everything else working.

Conclusion

So there are two conclusions here:

  1. You need the firehose part, otherwise some (not all) subgraphs stop.
  2. My substreams endpoint is having trouble, nobody knows why.

Worth noting even with Pinax's stack, the Substreams get stuck after an hour or two, requiring graphnode restarts.

trader-payne commented 1 year ago

Now I'm on my own Firehose + Pinax's substreams, and a handful of subgraphs stopped syncing, no logs to share: image

QmeUiEKEDTJ5gPRoN5FcUZn2tp1cBdae5DMbFDiL8UPxrZ
QmTyCFxA6XqBhFSht2H4pgo8Y5TPimuKjkJPuHiVCsQYij
QmSGDNPW2iAwwgh4He5eeBTaSLeTJpWKzuNfP2McAdBVq1
QmeUiEKEDTJ5gPRoN5FcUZn2tp1cBdae5DMbFDiL8UPxrZ
QmPziMSbs4i7oQ6AfxmY5HDXMRVHiBBb8soWqncLpKfQGr
QmcoMfXV54X21zw7TChyBPAKqj2csdEawCdxjsj98FKhsc
azf20 commented 1 year ago

Thanks so much @trader-payne - when you don't have the Firehose configured does the block ingestor seem to be progressing?

Worth noting even with Pinax's stack, the Substreams get stuck after an hour or two, requiring graphnode restarts.

This is in "Scenario 3 - Pinax's substreams + my Firehose" - i.e. there is no persistent "success" configuration for you right now?

maoueh commented 1 year ago

Scenario 1 - my erigon + pinax substream, no firehose defined

Firehose is not involved anymore in this config, each non-Substreams subgraph will be served by standard web3 providers. The fact that non-Substreams fails could be caused maybe bad problem transitioning from Firehose -> Web3 provider (I assumed that before Scenario 1, those Subgraphs were served by Firehose).

The logs shows which provider serves a particular Subgraph.

trader-payne commented 1 year ago

Thanks so much @trader-payne - when you don't have the Firehose configured does the block ingestor seem to be progressing?

No, it does, not, even though you can find the logs saying Sep 20 23:45:09.872 INFO Starting firehose block ingestor for network, network_name: mainnet And no errors nor anything suggesting it won't work

EDIT: looking at the Grafana dashboard, it seems the block ingestor works, last block number in the db keeps increasing, but logs don't show this:

docker logs index-node-0 -f --tail 1000000 2>&1 | grep -i mainnet
Sep 20 23:49:03.910 INFO Started block ingestor, network_name: mainnet
Sep 20 23:49:03.910 INFO Starting firehose block ingestor for network, network_name: mainnet

^those are all the logs

This is in "Scenario 3 - Pinax's substreams + my Firehose" - i.e. there is no persistent "success" configuration for you right now?

Correct, there's no working solution on my side.

trader-payne commented 1 year ago

@azf20 @maoueh changing to this graphnode configuration works! 👀

[chains.${CHAIN_0_NAME}]
shard = "primary"
protocol = "ethereum"
provider = [
{ label = "substreams", details = { type = "substreams", url = "${CHAIN_0_RPC_SUBSTREAMS}", token = "${STREAMING_FAST_API_TOKEN}" } },
#{ label = "firehose", details = { type = "firehose", url = "${CHAIN_0_RPC_FIREHOSE}", token = "${STREAMING_FAST_API_TOKEN}", features = ["compression", "filters"], conn_pool_size = 1 } },
{ label = "bootstrap", details = { type = "web3", url = "${CHAIN_0_RPC_ARCHIVE}", features = ["archive", "traces"] } }
]

Notice the different label and type added for the archive part. 👀 Block ingestor works, subgraphs are syncing. I'll leave this overnight, hope I won't get to eat my words, and it will keep going 🤞🏻

PS: this is using Pinax's Substreams endpoint and my archive node. Will try w/ my own substreams endpoint tomorrow and will post an update

image

It's clearly using my archive endpoint for non-substream subgraphs.

azf20 commented 1 year ago

Interesting - did it work overnight?

@mangas @lutter would be interested if you have a view here on some of the different behaviour described above

trader-payne commented 1 year ago

Yeah but I had my eth node crash due to memory constraints - it was running on the same machine as the index nodes 🔪 💀

trader-payne commented 12 months ago

Any luck with the investigation so far @azf20 @maoueh? Is there anything I can do to help here?

I found a potential bottleneck - the IO on the database server is basically pegged at 100% all the time: image

I'm gonna stop the substreams to see how/if that affects anything. EDIT: image

No change after killing the substreams, the IO stays the same, it's mainly read IO that's causing the load - lots of autovacuum processes running.

Even though there's this bottleneck, all the other subgraphs are keeping up to chainhead, and all the subgraphs that I'm syncing from scratch are progressing. The only thing stopping are the substreams. And I doubt @jhjhjh94 is in the same situation with a database containing 800+ subgraphs

azf20 commented 12 months ago

@trader-payne is this with the config which was initially working?

trader-payne commented 12 months ago

Any config actually, the db is choked. But like I said, subgraphs sync no problem.

madumas commented 12 months ago

Adding a data point to the investigation. Found the Uniswap v2 substream-powered subgraph stopped today, but too late for capturing logs (30k blocks behind).

The command substreams gui ipfs://QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag showed that the substream had also been interrupted around the same block and had now just resumed processing blocks into the cache. Once the substream appeared to have reached chain head (according to the gui), a few graph-node restarts over next 3 hours were necessary to bring graph-node to chain head, as it would stop on intermediary xx999 blocks as if the substream cache wasn't as far as indicated by the gui.

On this setup, firehose is disabled. Using an internal substream + rpc combo.

jhjhjh94 commented 12 months ago

Any luck with the investigation so far @azf20 @maoueh? Is there anything I can do to help here?

I found a potential bottleneck - the IO on the database server is basically pegged at 100% all the time: image

I'm gonna stop the substreams to see how/if that affects anything. EDIT: image

No change after killing the substreams, the IO stays the same, it's mainly read IO that's causing the load - lots of autovacuum processes running.

Even though there's this bottleneck, all the other subgraphs are keeping up to chainhead, and all the subgraphs that I'm syncing from scratch are progressing. The only thing stopping are the substreams. And I doubt @jhjhjh94 is in the same situation with a database containing 800+ subgraphs

Confirmed the same from my side, IO bottleneck, with or without substreams still the same bt

maoueh commented 11 months ago

@jhjhjh94

Confirmed the same from my side, IO bottleneck, with or without substreams still the same

I/O is the same with and without Substreams?

maoueh commented 11 months ago

@trader-payne No work has been done yet, but I would like to build now a version with extra logging to check where the hanging is happening. I will branch it from v0.32.0 for now.

maoueh commented 11 months ago

For Substreams powered Subgraph stuck, version graphprotocol/graph-node:canary-investigate-stuck-substreams-def370693 can be tested.

Logs I added are in INFO and they might be a bit verbose, it's for Substreams only also, so you can deploy only on the node that is going to index the Substreams powered Subgraph, it prints:

jhjhjh94 commented 11 months ago

@jhjhjh94

Confirmed the same from my side, IO bottleneck, with or without substreams still the same

I/O is the same with and without Substreams?

Yes, that's correct

maoueh commented 11 months ago

From @trader-payne testing graph-node.log

The subgraph doesn't move towards the chainhead during the logs period, at all

indexer-agent/t6oandDPdf (C) Qmf4VVPWSejxoGLg3nZuoCgPBxuARuiayjsdt6oandDPdf
1474
index_node_3
true
mainnet
275
55.0 min

Logs I shared are with Pinax's Substreams + My Firehose

maoueh commented 11 months ago

From there, it seems the gRPC is not stuck as progresses is sent regularly so the connection is still up. Here some data extraction from the logs above:

[11:49:01.180] Cursor { 
    value: "c1:1:18248084:ce8a9684a967ade283106876aa1aa0830889a2d32858ab3cadb79676e1866aa4:18248002:200e573a4b69f26eb58734d7f51ad7e350606b223c4a2ad5c19ab99f1cf0503f" 
} 
[11:49:13.910] SessionInit { 
    trace_id: "e0c0afb9363755551c73bf970ec663a4", 
    resolved_start_block: 18248085, 
    linear_handoff_block: 18248100, 
    max_parallel_workers: 10,
}
[11:49:31.538] ModulesProgress { 
    running_jobs: [
        Job { stage: 5, start_block: 18248000, stop_block: 18248100, processed_blocks: 0, duration_ms: 8759 }, 
        Job { stage: 4, start_block: 18248000, stop_block: 18248100, processed_blocks: 0, duration_ms: 8739 }, 
        Job { stage: 3, start_block: 18248000, stop_block: 18248100, processed_blocks: 0, duration_ms: 8719 }, 
        Job { stage: 8, start_block: 18248000, stop_block: 18248100, processed_blocks: 0, duration_ms: 8817 }, 
        Job { stage: 7, start_block: 18248000, stop_block: 18248100, processed_blocks: 0, duration_ms: 8798 }, 
        Job { stage: 6, start_block: 18248000, stop_block: 18248100, processed_blocks: 0, duration_ms: 8778 },
    ], 
    modules_stats: [
        ModuleStats { 
            name: "map_pool_created",
            total_processed_block_count: 300,
            total_processing_time_ms: 7771, 
            external_call_metrics: [ExternalCallMetric { name: "rpc:eth_call", count: 108, time_ms: 408 }],
            total_store_operation_time_ms: 0,
            total_store_read_count: 0,
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_pools",
            total_processed_block_count: 300,
            total_processing_time_ms: 233, 
            external_call_metrics: [],
            total_store_operation_time_ms: 0,
            total_store_read_count: 0,
            total_store_write_count: 6, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 83976637, 
            total_store_merging_time_ms: 603, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_output_token_supply",
            total_processed_block_count: 200,
            total_processing_time_ms: 4859, 
            external_call_metrics: [],
            total_store_operation_time_ms: 0,
            total_store_read_count: 60792,
            total_store_write_count: 48, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 17065215, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: true, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "map_pool_events",
            total_processed_block_count: 100,
            total_processing_time_ms: 2480, 
            external_call_metrics: [],
            total_store_operation_time_ms: 2,
            total_store_read_count: 123218,
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248000
        }, 
        ModuleStats { 
            name: "store_input_token_balances",
            total_processed_block_count: 100,
            total_processing_time_ms: 220, 
            external_call_metrics: [],
            total_store_operation_time_ms: 0,
            total_store_read_count: 0,
            total_store_write_count: 8310, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 99275, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248000
        }, 
        ModuleStats {
            name: "store_volume_by_token_amount",
            total_processed_block_count: 100,
            total_processing_time_ms: 312, 
            external_call_metrics: [],
            total_store_operation_time_ms: 0,
            total_store_read_count: 0,
            total_store_write_count: 6536, 
            total_store_deleteprefix_count: 6536, 
            store_size_bytes: 106693, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248000
        }
    ],
    stages: [
        Stage { modules: ["map_pool_created", "store_pools"], completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] }, 
        Stage { modules: ["store_output_token_supply"], completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] }, 
        Stage { modules: ["map_pool_events", "store_input_token_balances", "store_volume_by_token_amount"], completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248000 }] }, 
        Stage { modules: ["store_native_prices"], completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248000 }] }, 
        Stage { modules: ["store_usd_prices"], completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248000 }] }, 
        Stage { modules: ["map_events_entity", "map_token_entity", "store_volume", "store_pool_tvl"], completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248000 }] }, 
        Stage { modules: ["store_cumulative_fields", "store_protocol_daily_fields", "store_protocol_tvl", "store_daily_and_hourly_fields"], completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248000 }] }, 
        Stage { modules: ["map_liquidity_pool_snapshots_entity", "map_liquidity_pool_entity", "store_protocol_cumulative_fields"], completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248000 }] }, 
        Stage { modules: ["map_financial_daily_snapshot_entity", "map_protocol_entity", "graph_out"], completed_ranges: [] }
    ], 
    processed_bytes: Some(ProcessedBytes { total_bytes_read: 173213415, total_bytes_written: 0 })
}

...

[11:54:34.451] ModulesProgress { 
    running_jobs: [],
    modules_stats: [
        ModuleStats { 
            name: "store_output_token_supply", 
            total_processed_block_count: 800, 
            total_processing_time_ms: 27417, 
            external_call_metrics: [], 
            total_store_operation_time_ms: 2, 
            total_store_read_count: 243168, 
            total_store_write_count: 192, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 17065215, 
            total_store_merging_time_ms: 158, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_input_token_balances", 
            total_processed_block_count: 700, 
            total_processing_time_ms: 2379, 
            external_call_metrics: [], 
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 58170, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 277707832,
            total_store_merging_time_ms: 3235, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100 
        }, 
        ModuleStats { 
            name: "store_native_prices", 
            total_processed_block_count: 600, 
            total_processing_time_ms: 4731, 
            external_call_metrics: [], 
            total_store_operation_time_ms: 0, 
            total_store_read_count: 29916, 
            total_store_write_count: 39888, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 150312696,
            total_store_merging_time_ms: 1062,
            store_currently_merging: false, 
            highest_contiguous_block: 18248100 
        }, ModuleStats { 
            name: "map_events_entity", 
            total_processed_block_count: 400, 
            total_processing_time_ms: 2463, 
            external_call_metrics: [], 
            total_store_operation_time_ms: 0, 
            total_store_read_count: 26556,
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_protocol_tvl",
            total_processed_block_count: 300,
            total_processing_time_ms: 1572,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 4986, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 75, 
            total_store_merging_time_ms: 50, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_daily_and_hourly_fields",
            total_processed_block_count: 300,
            total_processing_time_ms: 32174,
            external_call_metrics: [],
            total_store_operation_time_ms: 5, 
            total_store_read_count: 0, 
            total_store_write_count: 58824, 
            total_store_deleteprefix_count: 58824, 
            store_size_bytes: 1790515, 
            total_store_merging_time_ms: 28, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "map_liquidity_pool_entity",
            total_processed_block_count: 200,
            total_processing_time_ms: 2528,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 29916, 
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_protocol_cumulative_fields",
            total_processed_block_count: 200,
            total_processing_time_ms: 603,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 13072, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 297, 
            total_store_merging_time_ms: 51, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "map_financial_daily_snapshot_entity",
            total_processed_block_count: 100,
            total_processing_time_ms: 2313,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 16620, 
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "map_pool_events",
            total_processed_block_count: 700,
            total_processing_time_ms: 23616,
            external_call_metrics: [],
            total_store_operation_time_ms: 24, 
            total_store_read_count: 862526, 
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "map_token_entity",
            total_processed_block_count: 400,
            total_processing_time_ms: 688,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_cumulative_fields",
            total_processed_block_count: 300,
            total_processing_time_ms: 1490,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 19608, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 94410895, 
            total_store_merging_time_ms: 1097, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "graph_out",
            total_processed_block_count: 100,
            total_processing_time_ms: 433,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_pools",
            total_processed_block_count: 900,
            total_processing_time_ms: 708,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 18, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 83976637, 
            total_store_merging_time_ms: 603, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_usd_prices",
            total_processed_block_count: 500,
            total_processing_time_ms: 3454,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 16745, 
            total_store_write_count: 8520, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 14233271, 
            total_store_merging_time_ms: 184, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_protocol_daily_fields",
            total_processed_block_count: 300,
            total_processing_time_ms: 1697,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 19608, 
            total_store_deleteprefix_count: 19608, 
            store_size_bytes: 451, 
            total_store_merging_time_ms: 50, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "map_pool_created",
            total_processed_block_count: 900,
            total_processing_time_ms: 23975,
            external_call_metrics: [
                ExternalCallMetric { name: "rpc: eth_call", count: 360, time_ms: 774 }
            ], 
            total_store_operation_time_ms: 0,
            total_store_read_count: 0,
            total_store_write_count: 0,
            total_store_deleteprefix_count: 0,
            store_size_bytes 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100,
        }, 
        ModuleStats { 
            name: "store_volume_by_token_amount",
            total_processed_block_count: 700,
            total_processing_time_ms: 11667,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 0, 
            total_store_write_count: 45752, 
            total_store_deleteprefix_count: 45752, 
            store_size_bytes: 674079, 
            total_store_merging_time_ms: 15, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_volume",
            total_processed_block_count: 400,
            total_processing_time_ms: 2379,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 13072, 
            total_store_write_count: 19608, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 88619831, 
            total_store_merging_time_ms: 727, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "store_pool_tvl",
            total_processed_block_count: 400,
            total_processing_time_ms: 2118,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 33240, 
            total_store_write_count: 6648, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 50919718, 
            total_store_merging_time_ms: 573, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "map_liquidity_pool_snapshots_entity",
            total_processed_block_count: 200,
            total_processing_time_ms: 6221,
            external_call_metrics: [],
            total_store_operation_time_ms: 6, 
            total_store_read_count: 119664, 
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }, 
        ModuleStats { 
            name: "map_protocol_entity",
            total_processed_block_count: 100,
            total_processing_time_ms: 438,
            external_call_metrics: [],
            total_store_operation_time_ms: 0, 
            total_store_read_count: 8310, 
            total_store_write_count: 0, 
            total_store_deleteprefix_count: 0, 
            store_size_bytes: 0, 
            total_store_merging_time_ms: 0, 
            store_currently_merging: false, 
            highest_contiguous_block: 18248100
        }
    ], 
    stages: [
        Stage { 
            modules: ["map_pool_created", "store_pools"], 
            completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] 
        }, 
        Stage { 
            modules: ["store_output_token_supply"], 
            completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] 
        },
        Stage { 
            modules: ["map_pool_events", "store_input_token_balances", "store_volume_by_token_amount"], 
            completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }]
        }, 
        Stage { 
            modules: ["store_native_prices"], 
            completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] 
        }, 
        Stage { 
            modules: ["store_usd_prices"], 
            completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] 
        }, 
        Stage { 
            modules: ["map_events_entity", "map_token_entity", "store_volume", "store_pool_tvl"], 
            completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] 
        }, 
        Stage { 
            modules: ["store_cumulative_fields", "store_protocol_daily_fields", "store_protocol_tvl", "store_daily_and_hourly_fields"], 
            completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] 
        }, 
        Stage { 
            modules: ["map_liquidity_pool_snapshots_entity", "map_liquidity_pool_entity", "store_protocol_cumulative_fields"], 
            completed_ranges: [BlockRange { start_block: 10008300, end_block: 18248100 }] 
        }, 
        Stage { 
            modules: ["map_financial_daily_snapshot_entity", "map_protocol_entity", "graph_out"], 
            completed_ranges: [BlockRange { start_block: 18248000, end_block: 18248100 }] 
        }
    ], 
    processed_bytes: Some(ProcessedBytes { total_bytes_read: 1864069898, total_bytes_written: 0 }) 
}
trader-payne commented 11 months ago

@maoueh not sure what that means - is there anything I can do to make them progress as intended? This started to affect people querying those subgraphs and it's only getting worse. Restarting the index nodes only gets them going for a little while until they get stuck again. Super annoying.

maoueh commented 11 months ago

Not clear yet, further investigation will be required, I'll probably need to change a bit the log but most probably, we will need to get server logs to understand what is happening or add more info to the progress message.

maoueh commented 11 months ago

trace.log

Server's trace log, I think we will need to reproduce again with the server's in debug mode to log more stuff. Will analyze the logs for now.

maoueh commented 11 months ago

@trader-payne Pinax's log for your request ends at 13:00 but your ends as ~11:54, did you let it run until 13:00?

trader-payne commented 11 months ago

I don't know how long I let it run, but I can reproduce this again and talk to Pinax to grab logs together. Shall I?

maoueh commented 11 months ago

Yes, we will need more logs to understand what is happening, we will need the server to be in debug mode to collect deeper logs that would help the investigation. So if you can talk with Pinax to coordinate that, would be perfect.

I got my hands on Pinax's log for your query, but it's in INFO and it was not enough to better understand the problem, although we have some hypothesis.

jhjhjh94 commented 11 months ago

Updated graphnodes to canary-investigate-stuck-substreams-def370693

Currently fetching logs for QmQJovmQLigEwkMWGjMT8GbeS2gjDytqWCGL58BEhLu9Ag (Substreams Uniswap v3 Ethereum on Goerli) so far it's synced, waiting for the next stall and will provide the logs

jhjhjh94 commented 11 months ago

@maoueh graph.log

It wasn't stalling for the past 24 hours...I had to restart the index-node then it stalled...Attached above are the logs

maoueh commented 11 months ago

@jhjhjh94 At what timestamp would you pinpoint when the stall started? Also, do you have by chance more logs of that, ideally since the last Blockstreams disconnected, connecting.

Do you also have the logs of the backend server by chance?

jhjhjh94 commented 11 months ago

@jhjhjh94 At what timestamp would you pinpoint when the stall started? Also, do you have by chance more logs of that, ideally since the last Blockstreams disconnected, connecting.

Do you also have the logs of the backend server by chance?

@maoueh The stall started the same time the log started

I just did another run (I had to manually restart the index node then it starts stalling, unable to reproduce the "sudden stall".

logs looks almost the same as the one I submitted above. (Waiting for next response....Next response received... once in a while will have "Received progress update" and repeats).

maoueh commented 11 months ago

Next step will be to collect DEBUG logs on the Substreams server when the stall happen, tier1 logs filtered by trace ID of the request (my version and 0.33 of graph-node should both logs it on re-connection) will be enough as a starting point.

From the progress, we see that there is no running jobs anymore but data seems that nothing is then sent. We will need the Substreams server logs to correlate what is happening.

I've also slightly tweaked the Next response received to also print the message received, so will help with the stream visibility. The image graphprotocol/graph-node:canary-investigate-stuck-substreams-ccac28fd4 should be available soon (CI running right now).

azf20 commented 11 months ago

thanks @maoueh @jhjhjh94 @trader-payne - seems like we need more info from the Substreams side, is there anything else we can add to Graph Node to understand what is happening here? cc @mangas

maoueh commented 11 months ago

@azf20 Printing the progress messages at regular interval will be something needed IMO to have. Essentially, it's having this commit https://github.com/graphprotocol/graph-node/commit/def370693f66915727028ddeb1695785e9877b78 re-worked so that progress messages are printed, we should also re-work the log presentation to be more concise.