snowflakedb / gosnowflake

Go Snowflake Driver
Apache License 2.0
292 stars 122 forks source link

SNOW-1346547: Error when using async queries: "390112: Your session has expired. Please login again." #1119

Closed williamhbaker closed 1 month ago

williamhbaker commented 4 months ago

Please answer these questions before submitting your issue. In order to accurately debug the issue this information is required. Thanks!

  1. What version of GO driver are you using? 1.9.0

  2. What operating system and processor architecture are you using? darwin/arm64

  3. What version of GO are you using? go version go1.22.0 darwin/arm64

4.Server version:* E.g. 1.90.1 8.16.0

  1. What did you do?
  1. What did you expect to see?

The program continue to run. Instead, it will occasionally crash with an error "390112: Your session has expired. Please login again."

In the driver code, I see here where sessions should be getting renewed. From what I can see, an async query will later end up in this loop. I'm wondering if a session is still valid on that first check but just about to expire, and the session ends up expired in the retry loop without being gracefully handled? This would race with the session renewal heartbeat process that is enabled with client_session_keep_alive.

  1. Can you set logging to DEBUG and collect the logs?

Not easily - this is currently being worked-around by always creating a new client when a batch of async queries need to be run. My guess is that this gets a "fresh" session each time, instead of letting the prior session expire.

  1. What is your Snowflake account identifier, if any? (Optional)
sfc-gh-dszmolka commented 4 months ago

hi and thanks for raising this and the hints too! I'll take a look and see if i can reproduce. speaking about which, do you by any chance have perhaps a reproduction which when run, leads to the issue you're experiencing? probably would speed up debugging. Thank you in advance !

sfc-gh-dszmolka commented 4 months ago

trying to reproduce the issue, so far with no success. Using the following approach with go1.21 and gosnowflake 1.9.0..

  1. since the error message clearly indicates the authentication token expired, focusing on this direction. This token by default has lifetime of 4 hours. With session policies its possible to change that.

So implemented a 5-minute session policy instead of the 4-hour to speed up tests. Now my authentication token is only valid for 5m.

CREATE SESSION POLICY test_db.public.session_policy_5m
  SESSION_IDLE_TIMEOUT_MINS = 5
  SESSION_UI_IDLE_TIMEOUT_MINS = 5
  COMMENT = 'Session policy 5m'
;
ALTER ACCOUNT SET SESSION POLICY test_db.public.session_policy_5m;
  1. using the following simple repro program:
    
    package main

import ( "context" "database/sql" "fmt" "log" "time" "strconv" sf "github.com/snowflakedb/gosnowflake" )

func main() { connectionString := "username:password@myorg-myaccount/test_db/public?warehouse=COMPUTE_WH&role=ACCOUNTADMIN&tracing=TRACE&CLIENT_SESSION_KEEP_ALIVE=true&CLIENT_SESSION_KEEP_ALIVE_HEARTBEAT_FREQUENCY=3600" // default heartbeat is 3600 (1h) anyways noCachedResults := "ALTER SESSION SET USE_CACHED_RESULT = FALSE;" selectQuery := "SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.PARTSUPP;" // 800M rows, 35.1GB data sleepSeconds := 290

db, err := sql.Open("snowflake", connectionString) if err != nil { log.Fatal(err) } defer db.Close()

// enable async mode ctx := sf.WithAsyncMode(context.Background()) conn, err := db.Conn(ctx)

if err != nil { log.Fatalf("Failed to acquire connection. err: %v", err) } defer conn.Close()

_, err = conn.ExecContext(ctx, noCachedResults) if err != nil { log.Fatalf("failed to execute ALTER SESSION. err: %v", err) }

// sleep for a little less than 5 minutes (300s) // Session idle policy is set to 5 minutes in this test fmt.Println("Sleeping " + strconv.Itoa(sleepSeconds) + " seconds.") time.Sleep(time.Duration(sleepSeconds) * time.Second)

// now run the big query in an async manner // this query surely should take more than 5 minutes on a x-small warehouse rows, err := conn.QueryContext(ctx, selectQuery) if err != nil { // 'handle' error fmt.Println(err) } defer rows.Close()

var counter int var v1, v2, v3, v4, v5 sql.NullString

for rows.Next() { err = rows.Scan(&v1, &v2, &v3, &v4, &v5) if err != nil { log.Fatalf("failed to get result. err: %v", err) } // not really interested in query result, but still want it executed fully counter++ } if rows.Err() != nil { fmt.Printf("ERROR: %v\n", rows.Err()) return } fmt.Println(counter) }



The query takes around 8.5 minutes on a x-small warehouse to execute. Here's what happens:
* `ALTER SESSION` executes immediately then we wait 290 seconds
* 10 seconds left from the token lifetime. We issue the big query (= token lifetime is now again 300s)
* since the query is executed asynchronously, the driver automatically polls every 45 seconds to ask Snowflake if the query is done yet and results available. It is not done yet.
* Every 45 seconds the polls happen
* After 8+ minutes, the last poll encounters a SUCCESS response from Snowflake, and starts to download the query results.

Never encountered `Your session has expired. Please login again` error so far, hinting the authentication token expired, which makes sense since the query itself and the polls should renew it. 

So at this point I'm a little baffled how to get to the issue and therefore to troubleshoot further, it would be really appreciated if you could please send a minimal viable reproduction program, which when run, would exhibit the issue you're having.

As a mitigation, you can perhaps consider adding `CLIENT_SESSION_KEEP_ALIVE_HEARTBEAT_FREQUENCY=900` to the [session parameters](https://github.com/snowflakedb/gosnowflake/blob/v1.9.0/driver_test.go#L83-L84) and see if it helps you, regardless of whether i can or cannot reproduce the issue ? (supposed to send a heartbeat every 15 minutes, instead of the default 1 hours)

Thank you in advance for the reproducible example !
williamhbaker commented 4 months ago

Here's a reproduction of the issue, which has up to this point been illusive to me. It has to run for 1 hour in order for the error to occur, seemingly regardless of what the session idle timeout is set to in the session policy. The 2 hour SYSTEM$WAIT just needs to be long enough to be longer than 1 hour apparently, since the program will exit with the "session expired" error after 1 hour has elapsed.

func main() {
    connectionString := "redacted"

    db, err := sql.Open("snowflake", connectionString)
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()

    ctx := context.Background()

    if err := db.PingContext(ctx); err != nil {
        log.Fatal(err)
    }

    res, err := db.ExecContext(sf.WithAsyncMode(ctx), "CALL SYSTEM$WAIT(2, 'HOURS');")
    if err != nil {
        log.Fatal(err)
    }

    if _, err := res.RowsAffected(); err != nil {
        log.Fatal(err)
    }
}
sfc-gh-dszmolka commented 4 months ago

thank you for the hint and the repro - I'm taking a look.

sfc-gh-dszmolka commented 4 months ago

looks like you were right; 1 hour is indeed relevant - because it looks like the heartbeat itself which is breaking things and the heartbeat happens after one hour. We'll fix this and I'll update this thread on the progress as any new information becomes available.

sfc-gh-dszmolka commented 2 months ago

fix PR under review at https://github.com/snowflakedb/gosnowflake/pull/1160

sfc-gh-dszmolka commented 2 months ago

fix PR is merged and will be part of the next release

sfc-gh-dszmolka commented 1 month ago

released with gosnowflake v1.11.0 in July 2024