snowflakedb / snowflake-connector-nodejs

NodeJS driver
Apache License 2.0
121 stars 129 forks source link

SNOW-1737830: Sporadic error seen in AWS lambda after upgrading to 1.14.0 driver #936

Open spdaley opened 1 day ago

spdaley commented 1 day ago

Info

  1. What version of NodeJS driver are you using?
"snowflake-sdk": "1.14.0"
  1. What operating system and processor architecture are you using?
AWS Lambda Linux Instances
  1. What version of NodeJS are you using?
Node 20
  1. What are the component versions in the environment (npm list)?
too many to list?
  1. What did you do?
Unfortuantely, I don't have a complete runnable program for you. This was just something that started happening in our logs after ugprading to 1.14.0. Our application runs in AWS Lambda. We generally try to preserve the snowflake connections across lambda invocations. This typically works just fine. After upgrading to 1.14.0 we started noticing very sporadic errors in the log which say:

Encountered an error when sending the request. Details: {}

There were not a large number of these but enough to set off our error monitoring. Also, the ones we found did not seem to be causing any issues.

From the logs it looks like when some of the lambda invocations start, a snowflake heartbeat event will get started and if the lambda finishes quickly we end up seeing an error in a subsequent invocation. See logs below.

Note, one other thing is these lambdas have multiple data sources that they deal with and sometimes do not actually talk to snowflake during an invocation. I believe all of the incidents where I saw this happen were during executions that never used snowflake in the end.
  1. What did you expect to see?
No error
  1. Can you collect debug logs?
We have some logs now. I've included when the lambdas started and finished in the logs but I've pruned out everything else we were doing. 

"2024-10-10T16:10:09.234Z","REPORT RequestId: 8c70ef49-88c6-4792-ad95-4345f126859d  Duration: 87.92 ms  Billed Duration: 88 ms  Memory Size: 1024 MB    Max Memory Used: 468 MB"
"2024-10-10T16:10:09.234Z","END RequestId: 8c70ef49-88c6-4792-ad95-4345f126859d"
"2024-10-10T16:10:09.209Z","[4:10:09.209 PM]: Encountered an error when sending the request. Details: {}"
"2024-10-10T16:10:09.146Z","START RequestId: 8c70ef49-88c6-4792-ad95-4345f126859d Version: 2842"
"2024-10-10T16:10:07.442Z","REPORT RequestId: 5d5b2ef4-3a97-4334-87bb-a5696f10c641  Duration: 92.46 ms  Billed Duration: 93 ms  Memory Size: 1024 MB    Max Memory Used: 468 MB"
"2024-10-10T16:10:07.442Z","END RequestId: 5d5b2ef4-3a97-4334-87bb-a5696f10c641"
"2024-10-10T16:10:07.436Z","[4:10:07.436 PM]: OCSP validation disabled for cn58973.us-east-2.aws.snowflakecomputing.com"
"2024-10-10T16:10:07.433Z","[4:10:07.433 PM]: CALL POST with timeout 90000: https://CN58973.us-east-2.aws.snowflakecomputing.com/session/heartbeat?requestId=736de361-4d9e-495e-b870-47787f3b0370"
"2024-10-10T16:10:07.433Z","[4:10:07.433 PM]: Get agent with id: https://cn58973.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"
"2024-10-10T16:10:07.398Z","[4:10:07.398 PM]: OCSP validation disabled for cn58973.us-east-2.aws.snowflakecomputing.com"
"2024-10-10T16:10:07.396Z","[4:10:07.396 PM]: CALL POST with timeout 90000: https://CN58973.us-east-2.aws.snowflakecomputing.com/session/token-request"
"2024-10-10T16:10:07.396Z","[4:10:07.395 PM]: Get agent with id: https://cn58973.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"
"2024-10-10T16:10:07.355Z","[4:10:07.355 PM]: OCSP validation disabled for cn58973.us-east-2.aws.snowflakecomputing.com"
"2024-10-10T16:10:07.353Z","[4:10:07.350 PM]: CALL POST with timeout 90000: https://CN58973.us-east-2.aws.snowflakecomputing.com/session/heartbeat?requestId=736de361-4d9e-495e-b870-47787f3b0370"
"2024-10-10T16:10:07.353Z","[4:10:07.350 PM]: Get agent with id: https://cn58973.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"
"2024-10-10T16:10:07.353Z","[4:10:07.349 PM]: Issuing heartbeat call"
"2024-10-10T16:10:07.349Z","START RequestId: 5d5b2ef4-3a97-4334-87bb-a5696f10c641 Version: 2842"

As stated above, this run didn't actually use the snowflake connection at all. It was performing an action against a different data source. We were running the 1.13.0 driver for a few weeks without seeing this. It was only after upgrading to 1.14.0 that we started to seeing.

I'm sorry it's not much to go on. I noticed in the release notes for 1.14.0 there was a fix to the heartbeat callback handler. Not sure if this is related or not but it was the only thing noticeable I saw in this path.

sfc-gh-dszmolka commented 1 day ago

hi and thank you for raising this issue! one notable change in 1.14.0 regarding the heartbeats was https://github.com/snowflakedb/snowflake-connector-nodejs/pull/908. Obviously we could just revert that in a patch to test if that single change fixes the issue for you, but then we would not get any details about the actual error it causes.

Unfortunately it seems during JSON.stringify-ing the error, it gets lost :( So I made a small patch with patch-package which does nothing else, just changes how we log the error thrown.

snowflake-sdk+1.14.0.patch

Would it be possible for you to:

The same way we can revert #908 too in a subsequent step to get proof it caused the error (or the opposite) but if we were doing it right now, there would be no details of the errors it causes, so I'm hoping we can do it in two steps. Thank you in advance !

spdaley commented 1 day ago

hi and thank you for raising this issue! one notable change in 1.14.0 regarding the heartbeats was #908. Obviously we could just revert that in a patch to test if that single change fixes the issue for you, but then we would not get any details about the actual error it causes.

Thanks. I should be able to do that. It may take a day or so but I'll let you know once I get results.

spdaley commented 14 hours ago

We've got this error a few more times now since I deployed the patched snowflake. Here's the additional log we get now:

[12:18:59.899 AM]: TypeError: conn.getId is not a function
    at /var/task/resolvers/node_modules/@oort-dev/cnt-services-infra/dist/src/common/SnowflakeConnection.js:41:63
    at options.callback (/opt/nodejs/node_modules/snowflake-sdk/lib/services/sf.js:1333:28)
    at Object.callback (/opt/nodejs/node_modules/snowflake-sdk/lib/services/sf.js:676:43)
    at /opt/nodejs/node_modules/snowflake-sdk/lib/http/base.js:38:24
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  request:
sfc-gh-dszmolka commented 9 hours ago

thank you so much @spdaley, this helps! before going into disabling configuration which worked perfectly for you prior to installing 1.14.0 with #908 , do you have some time to try one more patch ?

snowflake-sdk+1.14.0.patch

This is the same as before (adding/fixing logging) plus reverting #908 . Nothing else was changed. So with this patch, it should be very easy to get proof if really the aforementioned PR caused the problems for you or not. If the errors cease to happen with 1.14.0 + patch which reverts 908, we should have a good indication where it comes from.

Thank you !

spdaley commented 6 hours ago

I can give that patch a shot too. One thing I didn't notice before is that our code seems to be where that error originates. Line 41 of that file looks like this:

            localConnection.connect((err, conn) => {
                if (err) {
                    reject(err);
                }
                else {
                    logger.debug('Connection Success: ', conn.getId()); <------ this is line 41
                    ...
                }
            });

None of this code has changed between 1.13.1 and 1.14.0 though. We've only been changing the driver. Is this the callback that is getting passed to the heartbeat now? Is the callback being executed without the conn? Or with something else? Or is there a bug in our callback? It pretty much follows your docs other than we are using arrow functions instead of a regular function.

spdaley commented 47 minutes ago

I have a stand-alone script that reproduces it now. Do the following:

  1. npm install snowflake-sdk 1.14.0
  2. Modify your installed environment and apply the first patch which logs the exception
  3. Optional Modify node_modules/snowflake-sdk/lib/connection/connection.js and change line 179 (give or take) to self.keepalive = setInterval(self.heartbeat, 10 * 1000, callback); ... note I did this because it looks like the minimum heartbeat frequency is 900 seconds? I got tired of waiting so just hard-coded it to 10. The same issue happens with 900 seconds though.
  4. Modify the following script and set the SNOWFLAKE_ values to something appropriate for your env
  5. Run it with node
  6. You should get output like this:
    {"level":"TRACE","message":"[10:38:34.003 AM]: CALL POST with timeout 90000: https://SF13288.us-east-2.aws.snowflakecomputing.com/queries/v1/query-request?requestId=8200ecbd-debe-4bca-9d3c-e5f34aa6680a"}
    {"level":"TRACE","message":"[10:38:34.091 AM]: socket reused = true"}
    Failed to execute statement due to the following error: SQL compilation error:
    Object 'YOUR_TABLE' does not exist or not authorized.
    {"level":"DEBUG","message":"[10:38:44.001 AM]: Issuing heartbeat call"}
    {"level":"TRACE","message":"[10:38:44.003 AM]: Get agent with id: https://sf13288.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"}
    {"level":"TRACE","message":"[10:38:44.004 AM]: CALL POST with timeout 90000: https://SF13288.us-east-2.aws.snowflakecomputing.com/session/heartbeat?requestId=9697869b-8e50-41bd-9803-165bb51ec236"}
    {"level":"TRACE","message":"[10:38:44.099 AM]: socket reused = true"}
    {"level":"DEBUG","message":"[10:38:44.170 AM]: Encountered an error when sending the request. Details: "}
    {"level":"DEBUG","message":"[10:38:44.174 AM]: TypeError: conn.getId is not a function\n    at /Users/seadaley/workspace/test2/test-backup.js:52:85\n    at options.callback (/Users/seadaley/workspace/test2/node_modules/snowflake-sdk/lib/services/sf.js:1333:28)\n    at Object.callback (/Users/seadaley/workspace/test2/node_modules/snowflake-sdk/lib/services/sf.js:676:43)\n    at /Users/seadaley/workspace/test2/node_modules/snowflake-sdk/lib/http/base.js:38:24\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {\n  req"}

    For this test I didn't actually set a valid table because I didn't really care. I just wanted something in there to let me know the connection was working.

The script

const snowflake = require('snowflake-sdk');

snowflake.configure({
  logLevel: "DEBUG"
});

// Constants for connection
const SNOWFLAKE_ACCOUNT = 'your_account';
const SNOWFLAKE_USER = 'your_username';
const SNOWFLAKE_PASSWORD = 'your_password';
const SNOWFLAKE_DATABASE = 'your_database';
const SNOWFLAKE_SCHEMA = 'your_schema';
const SNOWFLAKE_WAREHOUSE = 'your_warehouse';
const SNOWFLAKE_ROLE = 'your_role';
const SNOWFLAKE_TABLE = 'your_table';

// Create a Snowflake connection object
const connection = snowflake.createConnection({
  account: SNOWFLAKE_ACCOUNT,
  username: SNOWFLAKE_USER,
  password: SNOWFLAKE_PASSWORD,
  database: SNOWFLAKE_DATABASE,
  schema: SNOWFLAKE_SCHEMA,
  warehouse: SNOWFLAKE_WAREHOUSE,
  role: SNOWFLAKE_ROLE,
  clientSessionKeepAlive: true
});

// Connect to Snowflake and return a promise
function connectToSnowflake() {
  console.log('Connecting to the DB');
  return new Promise((resolve, reject) => {
    connection.connect((err, conn) => {
      if (err) {
        reject('Unable to connect: ' + err.message);
      } else {
        console.log(`Successfully connected to Snowflake with connection ID: ${conn.getId()}`);
        resolve(conn);
      }
    });
  });
}

// Function to execute the query
function executeQuery() {
  return new Promise((resolve, reject) => {
    const query = `SELECT COUNT(*) FROM ${SNOWFLAKE_TABLE}`;

    connection.execute({
      sqlText: query,
      complete: (err, stmt, rows) => {
        if (err) {
          reject('Failed to execute statement due to the following error: ' + err.message);
        } else {
          console.log('Query successfully executed.');
          console.log('Count: ', rows[0]['COUNT(*)']);
          resolve();
        }
      }
    });
  });
}

// Main function to connect and then execute the query
async function main() {
  try {
    await connectToSnowflake();
    await executeQuery();
  } catch (error) {
    console.error(error);
  }
}

// Run the main function
main();

process.stdin.resume();

Other note, if I apply the second patch which undoes the callback/this heartbeat change, and I run this script, I get this output:

{"level":"DEBUG","message":"[10:41:08.364 AM]: Issuing heartbeat call"}
{"level":"TRACE","message":"[10:41:08.366 AM]: Get agent with id: https://sf13288.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"}
{"level":"TRACE","message":"[10:41:08.367 AM]: CALL POST with timeout 90000: https://SF13288.us-east-2.aws.snowflakecomputing.com/session/heartbeat?requestId=72248e8d-6a5e-4e0e-bc66-74cb035173fb"}
{"level":"TRACE","message":"[10:41:08.451 AM]: socket reused = true"}
{"level":"DEBUG","message":"[10:41:08.520 AM]: Heartbeat response {\"data\":null,\"code\":null,\"message\":null,\"success\":true}"}