snowflakedb / snowflake-connector-net

Snowflake Connector for .NET
Apache License 2.0
173 stars 130 forks source link

SNOW-1337012: `ExecuteNonQuery` deadlocks when called from a ThreadPool thread #923

Open DVN237294 opened 2 months ago

DVN237294 commented 2 months ago

Issue description

We are currently using the PUT file:// API to upload files to an internal stage in Snowflake. Due to current limitations, PUTs must be done synchronously. We have however seen that calls to ExecuteNonQuery sometimes deadlocks our application. After some investigation I believe that these deadlocks stems from use of "sync over async" constructs in the RestRequester implementation, here:

https://github.com/snowflakedb/snowflake-connector-net/blob/14cf8a5daaf9611b237eb4b51ccdd61cb44761d0/Snowflake.Data/Core/RestRequester.cs#L51-L55

Post here will execute the PostAsync method on a ThreadPool thread and block the currently executing thread. However, when the currently executing thread is already a ThreadPool thread there is a risk that the work will be scheduled on the same thread, causing a deadlock.

For completeness, here is the code and query I am running:

using var connection = new SnowflakeDbConnection(_snowflakeConnectionString);
using var command = connection.CreateCommand();

connection.Open();
command.CommandTimeout = 0;
command.CommandText = string.Format(_uploadFileToStageScript.Contents, filePath, overwrite);
command.ExecuteNonQuery();

_uploadFileToStageScript.Contents contains PUT file://{{0}} @{Stage} AUTO_COMPRESS=false OVERWRITE={{1}};

Suggested solution

The proper solution would be to avoid any "sync over async" constructs and use entirely synchronous methods in the synchronous APIs. Secondarily, async support for PUT commands would be nice.

sfc-gh-dszmolka commented 2 months ago

hi - thank you for reporting this and providing all the details! i'll try to set up a repro and see

sfc-gh-dszmolka commented 2 months ago

this is how i try to reproduce the issue, and so far, did not see it. Of course not sure about its frequency either (or if I'm headed in the right direction at all.)

Repro attempt: run a big long SELECT query + PUT in the same thread and hope there's a deadlock. PUT has a ~22MB zipped CSV and executes in a couple seconds.

Tried the below too, with following modifications:

class Program { // Connection parameters private const string Account = "myorg-myaccount"; private const string Username = "myuser"; private const string Password = "mypassword"; private const string Warehouse = "mywh";

static void Main(string[] args)
{
    // Initialize Snowflake connection
    string connectionString = $"account={Account};user={Username};password={Password};warehouse={Warehouse};";
    using (var connection = new SnowflakeDbConnection())
    {
        connection.ConnectionString = connectionString;
        connection.Open();

        // Execute both queries in the same ThreadPool thread
        ThreadPool.QueueUserWorkItem(state =>
        {
            PutQuery1(connection);
            //PutQuery2(connection);
            SelectQuery(connection);

        });

        /*
        // Execute two queries in separate ThreadPool threads
        ThreadPool.QueueUserWorkItem(SelectQuery, connection);
        ThreadPool.QueueUserWorkItem(PutQuery, connection);
        */

        Thread.Sleep(5000); // 

        Console.WriteLine("Press any key to exit...");
        Console.ReadKey();
    }
}

static void SelectQuery(object state)
{
    SnowflakeDbConnection connection = (SnowflakeDbConnection)state;
    using (var command = connection.CreateCommand())
    {
        // quite big query, takes some time and returns 15.000.000 rows
        command.CommandText = "SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF100.CUSTOMER;";
        using (var reader = command.ExecuteReader())
        {
            int count = 0;
            while (reader.Read()) { count++; };
            Console.WriteLine($"Thread {Thread.CurrentThread.ManagedThreadId}: Select returned {count} rows.");
        }
    }
}

static void PutQuery1(object state)
{
    SnowflakeDbConnection connection = (SnowflakeDbConnection)state;
    using (var command = connection.CreateCommand())
    {
        command.CommandText = "PUT file://C:\\temp\\building-consents-issued-february-2024.zip @test_db.public.MYSTAGE1 auto_compress=false overwrite=true;";
        command.ExecuteNonQuery();
        Console.WriteLine($"Thread {Thread.CurrentThread.ManagedThreadId}: PUT query executed successfully.");
    }
}

static void PutQuery2(object state)
{
    SnowflakeDbConnection connection = (SnowflakeDbConnection)state;
    using (var command = connection.CreateCommand())
    {
        command.CommandText = "PUT file://C:\\temp\\building-consents-issued-february-2024-2.zip @test_db.public.MYSTAGE1 auto_compress=false overwrite=true;";
        command.ExecuteNonQuery();
        Console.WriteLine($"Thread {Thread.CurrentThread.ManagedThreadId}: PUT query 2 executed successfully.");
    }
}

}



Can you advise is this something representing what you're trying to do and see the issue with ? Or if there's a sample minimal viable repro program, which when run, exhibits the issue you're seeing - that would be fantastic and speed up the debugging a lot. Thank you in advance !
DVN237294 commented 2 months ago

Hey, thanks for taking the time to look into this.

I have been trying to debug this for a while, and I'm now at the point where I'm not entirely convinced the lockup is due to this "sync over async" construct. I have a dotnet-dump memory dump from the application, but my knowledge of TPL and/or default ThreadPool implementation is too limited to narrow the cause of the lockup to prove that this Task.Run(...).Result construct is at the cause of it. All I know for certain is that the threadpool thread is waiting "indefinitely" (or at least 16+ hours) on the UnwrapPromise returned by Task.Run.

I have been trying to get to the bottom of the question "can Task.Run ever deadlock?". Some say yes. Others (Herman Schoenfeld and stephentoub) says no. The only logical argument I can see why it would deadlock is if any continuation is put in the ThreadPool thread's local work queue (instead of the global work queue). I just feel like if that was the cause then it should deadlock much faster. So far, I've only seen one lockup over millions of PUT queries in the previous months. However If that was the cause of the deadlock, then I suppose starting the task with TaskCreationOptions.PreferFairness should alleviate any deadlock.

For now I have decided to ignore this issue until it may appear again, but from my point of view the proposed solutions in this issue still seem relevant: It is generally considered bad practice to do these kinds of "sync over async" workarounds, so it would make sense for this library to work towards avoiding it. If nothing else then just for the increased performance of not having to block threadpool threads.

sfc-gh-dszmolka commented 2 months ago

hi - appreciate all the additional details and digging deeper ! I agree we could look into avoiding the "sync over async" approach in general, and thus this is now an enhancement request for the team to consider (alongside with the 'async support for PUT) and if deemed feasible, then pick up and prioritize.