SNOW-1313544 Parallel operations on different roles cause program to crash (race condition on temporal credential cache)

Oracen commented 5 months ago

Please answer these questions before submitting your issue. In order to accurately debug the issue this information is required. Thanks!

Linked issue: https://github.com/Snowflake-Labs/terraform-provider-snowflake/issues/2681

What version of GO driver are you using? v1.7.5
What operating system and processor architecture are you using? linux_amd64 (Fedora)
What version of GO are you using?

1.19 (though I'm actually executing against Terraform v1.7.1)

4.Server version:* E.g. 1.90.1 You may get the server version by running a query: Unable to check at this point in time, though its a race condition error on my local machine so irrelevant.

What did you do? See the linked issue here for full details: https://github.com/Snowflake-Labs/terraform-provider-snowflake/issues/2681

In short, the client seems to be attempting to overwrite a shared map for some reason. This only happens when I am caching credentials for 2 different roles, and when there are many concurrent operations. Current guess is a race condition as the two logins overwrite one another.

What did you expect to see?

What should have happened and what happened instead? No crash, either separate storage of both lots of temp credentials or share the session and dynamically swap the roles.
Can you set logging to DEBUG and collect the logs?

https://community.snowflake.com/s/article/How-to-generate-log-file-on-Snowflake-connectors

Will amend the issue with further logging tomorrow

What is your Snowflake account identifier, if any? (Optional)

sfc-gh-dszmolka commented 5 months ago

hi and thank you for raising this issue - also for linking the Terraform Provider one + collecting the stacktrace, very helpful!

We already use a RWMutex for guarding read, write, and delete operations on the temporary credential cache file, so since it obviously doesn't seem to work as intended, it would be great to be able to reproduce this locally.

can you please share a reproduction code, even a TF script, which when run can lead to this issue? It would be greatly helpful in testing this issue and its future solution. Thank you in advance !

Oracen commented 5 months ago

I can share it to you privately...this is our main DB architecture that triggered the issue so messing around with it for debugging purposes is a little bit scary. I'll send you an invite to a private repo, obvious rules about disclosure apply. Apologies to anyone following that we can't just let you into our core infra codebase :laughing:

I've thrown my email into the readme of that repo to establish a backchannel about details/debugging. It should be a "minimal reproduction" of the part that caused us concern.

sfc-gh-dszmolka commented 5 months ago

This complicates the issue a bit and I might need to confer with our Legal team too if there's really no other way.

Is it perhaps possible to

create a separate repo which is not your actual infrastructure, is sanitized out of any possible credentials and maybe even the non-relevant object definitions which don't matter for triggering the issue
or, at least describe the circumstances needing to reproduce the issue so I could try to set up something for myself? e.g. need to create around 1000 objects with userA and 1000 objects with userB, or anything which is relevant for the issue?

I would understand if there is absolutely no other way besides seeing your actual infra definition, but I would like to avoid that as long as possible and really have it as a last resort, just to prevent any possible future complications. Thank you for understanding!

sfc-gh-dszmolka commented 4 months ago

there seems to have been a comment here for a moment which is now deleted, although it had very good ideas. Based on it, tried to reproduce the issue:

using registry.terraform.io/snowflake-labs/snowflake] 0.88.0 with Terraform v1.7.5
set up the 2 different provider definitions, for 2 different account, 2 different users, 2 different private keys
enable local caching for temporal credentials with executing this on both Snowflake accounts
```
alter account set ALLOW_ID_TOKEN = true;
alter account set ALLOW_CLIENT_MFA_CACHING = true;
```
using private key authentication (this can also trigger storage of temporary credentials into ${HOME}/.cache/snowflake/temporal_credential.json Note that both providers with the separate SF accounts/users; still use the one single temporal credentials file.
define 1-1 resource for both providers; in this example there was a warehouse to be created on both ones

besides account name, user name, etc. added the following into the provider definition:

..
private_key   = file("snowflake_tf_snow_key.p8")
authenticator = "JWT"
client_store_temporary_credential = true
client_request_mfa_token = true

also set up inotifywait to see the changes to the temporal credentials cache file

inotifywait -r -m -e modify  ${HOME}/.cache/snowflake/temporary_credential.json | 
while read file_path file_event file_name; 
do echo "$file_path modified"; cat $file_path; echo; 
done

When running terraform apply / destroy / plan, the following happens:

temporal credential cache file gets populated with the token from the first provider/account, looks like something like this:

${HOME}/.cache/snowflake/temporary_credential.json modified
{"MYORG-MYACCOUNT1.SNOWFLAKECOMPUTING.COM:ADMIN:SNOWFLAKE-GO-DRIVER:ID_TOKEN":"ver:1-hint:641522085531654-ET..99"}

as soon as warehouse creation/drop/discovery is performed on the first account, the temporal credentials file get overwritten with the token from the second account

${HOME}/.cache/snowflake/temporary_credential.json modified
{"MYORG-MYOTHERACCOUNT.SNOWFLAKECOMPUTING.COM:ADMIN2:SNOWFLAKE-GO-DRIVER:ID_TOKEN":"ver:1-hint:10494332937-ET..Vja"}

Then I tried to run the terraform commands (e.g. plan) in a loop which trigger the Snowflake connection , thus token creation, but could not get to the issue where the race condition would cause the same file to be written/read at the same time and thus cause a panic.

So i'm probably missing something, and therefore it would be really helpful if I could get further details for the repro (or an actual repro which reproduces the issue without having access to a production infrastructure definition without a NDA :) )

Oracen commented 4 months ago

I'll see what I can do, but there's the issue of only having the designated snowflake account to play with right now. I'll transition this to my personal machine to get a repro, as this is in dicey territory with respect to what I can share.

sfc-gh-dszmolka commented 4 months ago

that would be really helpful, thank you in advance! for the accounts; you can always sign up for a fresh trial account with $400 of free usage (recommended); or if you're ORGADMIN in your organization then CREATE and DROP accounts for testing.

Oracen commented 4 months ago

I didn't realise we could drop accounts, I'll create a dummy account and see what I can do

Oracen commented 4 months ago

I just upgraded our TF provider to 0.88.0, confirmed the issue is still present there.

Full repro steps:

Create new account, standard edition. Name it SANDBOX.

Set the following flags:

alter account set ALLOW_ID_TOKEN = true;
alter account set ALLOW_CLIENT_MFA_CACHING = true;

Create a new user to access the system, log in and enable Duo MFA.
Get the Terraform code, discussed in detail below.
CD into src/architecture

In src/architecture, create a file called secrets.tfvars. Fill in the following details.

snowflake_organisation="AN_ACCOUNT"
snowflake_username="test_user"
snowflake_password="hunter2"

Initialise the provider with terraform init
Create a workspace for your code, named the same as your account name (in this case, sandbox). The command is terraform workspace new sandbox. This will automatically select the new env
Run terraform apply -var-file=secrets.tfvars. The provider crashes.

With respect to code availability:

I invited you to a private repo containing sanitised Terraform config. It still contains IP in terms of the management structure, but shouldn't pose a legal issue due to our legal teams having signed a mutual NDA. As long as I'm not blasting it all over the internet, my employer won't mind a private repo.

Alternatively, if there really is no other option, I can further sanitise and attach a .zip file. I'd prefer to keep that file private, and I'll reach out via my account rep to get in touch with you directly.

Finally:

As part of the repro I managed to get a run with trace level logging. It's still heavily in the Terraform code but might hint at how the locking issue is happening. It's from a second failed run where I remembered to run the logging, but between you and the Terraform team it should give more context. Trying to send as a gist kept breaking Github so apologies for the zip.

errors.json.zip

sfc-gh-dszmolka commented 4 months ago

thank you for all the details. tried to accept the invite but by now it has expired. so came up with the following (after enabling MFA for both users in both accounts), inspired by the user who deleted their comment:

terraform {
  required_version = "~> 1.3"

  required_providers {
    snowflake = {
      source  = "Snowflake-Labs/snowflake"
      version = "0.88.0"
    }
  }
}

provider "snowflake" {
  alias         = "root"
  role          = "SYSADMIN"
  account       = "myorg-myaccount"
  user          = "admin"
  password  = "password"
  authenticator  = "UsernamePasswordMFA"
  client_store_temporary_credential = true
  client_request_mfa_token = true
}

provider "snowflake" {
  alias         = "dev"
  account       = "myorg-myaccount2"
  authenticator = "UsernamePasswordMFA"
  role          = "ACCOUNTADMIN"
  user          = "admin2"
  password  = "password"
  client_store_temporary_credential = true
  client_request_mfa_token = true
}

variable "database_names" {
  type    = list(string)
  # 1000-element array ;)
  default = ["testing10" ,"testing11" ,"testing12" .. ,"testing1009" ,"testing1010"]
}

resource "snowflake_database" "simple_root_loop" {
  for_each                 = toset(var.database_names)
  name                     = each.value
  comment                     = "test comment"
  data_retention_time_in_days = 1

  provider = snowflake.root
}

resource "snowflake_database" "simple_dev_loop" {
  for_each                 = toset(var.database_names)
  name                     = each.value
  comment                     = "test comment"
  data_retention_time_in_days = 1

  provider = snowflake.dev
}

this reproduced the problem, gosnowflake crashed with the same fatal error: concurrent map writes as in your error report.

we'll investigate and would like to thank you again for all the details !

Oracen commented 4 months ago

No worries, sorry I could't be more helpful! Bug couldn't have come at a worse time, schedule-wise :laughing:

sfc-gh-dszmolka commented 4 months ago

gosnowflake part fixed in https://github.com/snowflakedb/gosnowflake/pull/1103, which will be part of April 2024 release towards the end of the month.

Once it's released, the Snowflake Terraform Provider can be rebased onto it and also needs a new release to carry the fixed gosnowflake.

sfc-gh-dszmolka commented 4 months ago

released with 2024 April release cycle, version 1.10.0

snowflakedb / gosnowflake

SNOW-1313544 Parallel operations on different roles cause program to crash (race condition on temporal credential cache) #1092