reorg cleanup query timeout

ryandotsmith commented 2 months ago

It is possible that the query used to delete data during a reorg is too slow for the default 10s database timeout. During a reorg Shovel will delete data from shovel.task_updates with this query and all the integration's tables with this query.

It is possible that one of these two queries is taking longer than 10s. Since Shovel automatically creates an index on shovel.task_updates, I'm wondering if we are missing an index on the integration tables.

The ideal fix is to make these queries fast. Perhaps with a better index.

If you are experiencing this problem please use the postgres logs to show the exact query that is timing out. Also please share the table DDL so we can look for missing indexes.

lexomis commented 2 weeks ago

Hi Ryan, first time caller, but we've been successfully using shovel for a bit now. I'm bumping up against this bug consistently with my poly integration and I'd love to help debug.

Logs =>

2024-11-01 23:55:26 UTC [4086392]: [67256a57.3e5a78-6] 186145559 redacted@redacted,app=redacted [57014] CONTEXT: while deleting tuple (37,36) in relation "task_updates" 2024-11-01 23:55:26 UTC [4086392]: [67256a57.3e5a78-7] 186145559 redacted@redacted,app=redacted [57014] STATEMENT: delete from shovel.task_updates where src_name = $1 and ig_name = $2 and num >= $3 2024-11-01 23:55:26 UTC [4086392]: [67256a57.3e5a78-5] 186145559 redacted@redacted,app=redacted [57014] ERROR: canceling statement due to statement timeout 2024-11-01 23:55:30 UTC [4086394]: [67256a57.3e5a7a-10] 186145284 redacted@redacted,app=shovel-task-polygon-default-nativeTokenTransfers-0648 [25P03] FATAL: terminating connection due to idle-in-transaction timeout

Not sure exactly what you're looking for on DDL but here's the create tx =>


  shovel.task_updates (
    num numeric NULL,
    hash bytea NULL,
    insert_at timestamp with time zone NULL DEFAULT now(),
    src_hash bytea NULL,
    src_num numeric NULL,
    nblocks numeric NULL,
    nrows numeric NULL,
    latency interval NULL,
    src_name text NULL,
    stop numeric NULL,
    chain_id integer NULL,
    ig_name text NULL
  );

things I've done that haven't affected anything:

* the task_updates table only has about 1400 rows in it. so is pretty tiny. I've cleaned out old irrelevant logs (hasn't affected anything)
* extended the statement_timeout on the connection string (which I suspected is being overridden by the pool connection)
* extended the statement_timeout on the user as default (same result)
* manually ran the delete as the user (it was fast and always deleted 0 rows) (explain analyze was always like 0.5ms execution time or something dramatically lower than even the 5s timeout you place on the connection string)

happy to provide whatever else I can. shovel is a great piece of kit.

ugh. sorry about github's formatting which now seems to always be horrible.

lexomis commented 2 weeks ago

a few more points. this is always during reorg. and when this happens (which always seems to be poly), the entire thing grinds to a halt and none of the integrations continue. it's the last bit that makes this bug be a killer because I would expect that even if the poly integration falls over the rest would progress.

for now i have poly turned off and everything is running ok, but happy to turn it back on to help debug if i can.

ryandotsmith commented 2 weeks ago

@lexomis can you share the shovel logs during the time of the reorg / halt?

ryandotsmith commented 2 weeks ago

@lexomis can you give me a minimal config that has a chance of reproducing the issue? I bet if I can hit it locally then I can fix it quickly.

lexomis commented 2 weeks ago

well this is quite annoying. i woke up to make a few changes and then get the logs and try to build a recreateable pattern for you and it unblocked. now it's running fine against polygon.

not sure if this helps but when i was trying to recreate I am pretty sure i'd need to dump a portion of my task_updates table to you. When I went to clean it out with what appeared to be old blocks that were hours before the noted reorg blocks and update my startAt to the first block in my task_updates table, that unblocked the system.

this is about the third time I've seen this with poly so I suspect it won't be terribly long until I can reproduce again. Will be in touch at that time.

apologies if I missed our one chance to track this phantom down. lol.

ryandotsmith commented 2 weeks ago

Interesting. If you can give me your config.json to run on my local machine then I will have a good chance of reproducing and then fixing. I don't need your data. Just the config.json. Please remove anything that isn't related to the integration that is causing the problems.

Also, per your last comment, I still suspect that the delete on the task_updates table is timing out because of too much data. One way you could test this hypothesis is to:

1) Periodically delete from task_updates where insert_at < now() - '1 day'::interval or 2) If it gets stuck, psql into the database and run the query in 1

lexomis commented 2 weeks ago

Yes. I was following your hypothesis and the first time that I performed a delete there was not too much data there and the execution was less than 10ms so there couldn't have been a real timeout happening unless the timeout was set at 0.

I was convinced it was a permissions thing and weird errorring by the go library but that didn't seem to be the case either.

I'll keep watch. Config below.

Since I use timescale I've how made the task_updates a hypertable and turned on data retention for only the last day so hopefully this makes it all just work^tm

lexomis commented 2 weeks ago


  "pg_url": "$DATABASE_URL",
  "eth_sources": [
    {
      "name": "polygon-default",
      "chain_id": 137,
      "concurrency": 1,
      "poll_duration": "2s",
      "urls": [
        "$PRIMARY_PROVIDER_POLYGON_HTTP"
      ],
      "ws_url": "$PRIMARY_PROVIDER_POLYGON_WS"
    }
  ],
  "integrations": [
    {
      "enabled": true,
      "name": "nativeTokenTransfers",
      "sources": [
        {
          "name": "polygon-default",
          "start": "63771860"
        }
      ],
      "table": {
        "name": "native_token_transfers",
        "columns": [
          {
            "name": "block_unix_timestamp",
            "type": "int"
          },
          {
            "name": "chain_id",
            "type": "int"
          },
          {
            "name": "block_hash",
            "type": "bytea"
          },
          {
            "name": "transaction_hash",
            "type": "bytea"
          },
          {
            "name": "from",
            "type": "bytea"
          },
          {
            "name": "to",
            "type": "bytea"
          },
          {
            "name": "tokens",
            "type": "numeric"
          },
          {
            "name": "call_type",
            "type": "text"
          }
        ]
      },
      "block": [
        {
          "name": "block_time",
          "column": "block_unix_timestamp"
        },
        {
          "name": "chain_id",
          "column": "chain_id"
        },
        {
          "name": "block_hash",
          "column": "block_hash"
        },
        {
          "name": "tx_hash",
          "column": "transaction_hash"
        },
        {
          "name": "tx_signer",
          "column": "from"
        },
        {
          "name": "tx_to",
          "column": "to"
        },
        {
          "name": "tx_value",
          "column": "tokens",
          "filter_op": "gt",
          "filter_arg": [
            "0"
          ]
        }
      ]
    }
  ]
}```

ryandotsmith commented 2 weeks ago

Thank you for the config. I'm running it locally. Let's see if I can hit the issue too.

lexomis commented 2 weeks ago

Cool. I'm running too and watching as well.

ryandotsmith commented 2 weeks ago

So far so good on my end. Hit a couple of reorgs and no issues yet.

lexomis commented 2 weeks ago

quite annoying slash working like a charm! lol.

ryandotsmith commented 2 weeks ago

Ok. Let's keep an eye on it. This actually makes me think the bug is related to database size.

ryandotsmith commented 2 weeks ago

@lexomis I was finally able to reproduce the issue. I believe I have a fix that I just landed on main. The description of the bug and its fix are in the commit message. Please try 51f0111d9b54cd64b13db913ba30fbbeacc806f6

lexomis commented 1 week ago

Can confirm everything running well now.

I also had the same reorg hit on Poly but only had a chance to update today.

Your update unstuck everything and shovel is back to prring like a dream. Amazing. Thank you!

ryandotsmith commented 1 week ago

Wonderful. I am glad this bug has been put to rest.

indexsupply / shovel

reorg cleanup query timeout #272