Open notactuallytreyanastasio opened 2 years ago
Upon digging through some of the source here, my gut thinks the problem could be with send_heartbeat/1
and its SELECT 1
. It just seems like if there is a process signaling error to the upstream server (snowflake) it may be because some or possibly all of the time this SELECT 1
keeps it alive.
I dont know though, I'm just poking around a little while I am bored this evening.
check out ABORT_DETACHED_QUERY
I noticed this today while I was working on a new query to improve some stuff on an internal application.
A simple version of my query is like so:
In this case, we are gathering a group of about 200,000 items, and a group of about 90,000 items between the joins.
In the snowflake UI, this query takes about 8 minutes to run. If you have a sufficiently slow-ish query that is similar, anything in that ballpark should suffice. If you are a PepsiCo e-comm engineering employee and want an easy, reusable example, I can provide the full, real version of this query to use for debugging.
The long and short of it is this to demonstrate the problem:
iex -S mix
shellsYou will now see after about 10-15 minutes (accounting for some extra serialization time through ODBC) a couple have results come in, but if you check out the snowflake activity panel, you can eye the queued/running queries.
Our work configuration allows 2 queries/user to be run for developer account concurrently, so with 6 it begins to stack.
So, after about 20 minutes (our configured timeout, use whatever one you have configured for your application to hit this checking point) the stacking and queue will make it so its unfeasible for some to complete in under 20 minutes.
Now, we will see something like this
Note the bottom right, my connection that timed out. The other 5 are complete or have had
iex
exit and the process is dead.However, if we go and look at snowflake:
We still see 2 queries running. Both for over 20 minutes, that were certainly sent from Elixir-land.
This leads me to a couple theories:
:odbc
is not correctly handling process checkout from poolboy and sending the appropriate kill signalI'm not sure about how to go about fixing this, but I wanted to document it while it was fresh and I had screenshots that verified I am not silly and that I could in fact replicate this after it bit me earlier in the day. I initially thought that there was somehow some serious serialization overhead causing the timeout, but once I realized I had ~20 queries queued but only 1 open terminal process, it was pretty clear something is off in the library itself.