Upstream queries sometimes are not killed when downstream process crashes

notactuallytreyanastasio commented 2 years ago

I noticed this today while I was working on a new query to improve some stuff on an internal application.

A simple version of my query is like so:

WITH padded_upcs AS (
    SELECT
        item.TITLE,
        LPAD(guide_to_things.CODE, 13, '0') AS UPC
    FROM
        NAMESPACE.SUBNAMESPACE.ITEM item
    INNER JOIN
        NAMESPACE.SUBNAMESPACE.GUIDE guide_to_things
    ON
        item.FOO = guide_to_things.FOO
    WHERE
        UPPER(guide_to_things.CODE) = 'UPC'
    QUALIFY
        ROW_NUMBER()
    OVER
        (PARTITION BY item.ITEM ORDER BY item.LOAD_DTS DESC) = 1
),

gtins AS (
    SELECT
        item.TITLE,
        LPAD(guide_to_things.CODE, 13, '0') AS UPC
    FROM
        NAMESPACE.SUBNAMESPACE.ITEM item
    INNER JOIN
        NAMESPACE.SUBNAMESPACE.GUIDE guide_to_things
    ON
        item.FOO = guide_to_things.FOO
    WHERE
        UPPER(guide_to_things.CODE) = 'UPC'
    QUALIFY
        ROW_NUMBER()
    OVER
        (PARTITION BY item.ITEM ORDER BY item.LOAD_DTS DESC) = 1
)

SELECT
    p.TITLE,
    g.GTIN
FROM
    padded_upcs p
JOIN
    gtins g
ON
    charindex(p.upc, g.gtin) > 0

In this case, we are gathering a group of about 200,000 items, and a group of about 90,000 items between the joins.

In the snowflake UI, this query takes about 8 minutes to run. If you have a sufficiently slow-ish query that is similar, anything in that ballpark should suffice. If you are a PepsiCo e-comm engineering employee and want an easy, reusable example, I can provide the full, real version of this query to use for debugging.

The long and short of it is this to demonstrate the problem:

Open 6 iex -S mix shells
In each one, use snowflex to run the above query

You will now see after about 10-15 minutes (accounting for some extra serialization time through ODBC) a couple have results come in, but if you check out the snowflake activity panel, you can eye the queued/running queries.

Our work configuration allows 2 queries/user to be run for developer account concurrently, so with 6 it begins to stack.

So, after about 20 minutes (our configured timeout, use whatever one you have configured for your application to hit this checking point) the stacking and queue will make it so its unfeasible for some to complete in under 20 minutes.

Now, we will see something like this

Note the bottom right, my connection that timed out. The other 5 are complete or have had iex exit and the process is dead.

However, if we go and look at snowflake:

We still see 2 queries running. Both for over 20 minutes, that were certainly sent from Elixir-land.

This leads me to a couple theories:

Somehow :odbc is not correctly handling process checkout from poolboy and sending the appropriate kill signal
We are doing something silly and swallowing the failure yet raising the exception and a process is raising a poolboy exception but not killed (seems less likely since I ctrl-c'd the top left terminal)
Its something else deeper and I haven't the slightest idea

I'm not sure about how to go about fixing this, but I wanted to document it while it was fresh and I had screenshots that verified I am not silly and that I could in fact replicate this after it bit me earlier in the day. I initially thought that there was somehow some serious serialization overhead causing the timeout, but once I realized I had ~20 queries queued but only 1 open terminal process, it was pretty clear something is off in the library itself.

notactuallytreyanastasio commented 2 years ago

Upon digging through some of the source here, my gut thinks the problem could be with send_heartbeat/1 and its SELECT 1. It just seems like if there is a process signaling error to the upstream server (snowflake) it may be because some or possibly all of the time this SELECT 1 keeps it alive.

I dont know though, I'm just poking around a little while I am bored this evening.

dustinfarris commented 2 years ago

check out ABORT_DETACHED_QUERY

pepsico-ecommerce / snowflex

Upstream queries sometimes are not killed when downstream process crashes #44