Open heliobmartins opened 1 year ago
I wonder if this is actually a continuation from the problem reported on: https://github.com/hashicorp/vault/issues/18323#issue-1492928626
If we compare the issue we were having before (the core
was panicking) right after the log 2023-02-15T03:26:22.003Z [DEBUG] core: successfully dispensed v5 backend plugin: name=slauth
.
Now, instead of panicking, it looks like the core
is killing the external plugin (see: 2023-02-15T03:26:26.148Z [DEBUG] core: killed external plugin process: path=/etc/vault/plugins/slauth pid=161
).
Please, let me know if you need any extra information.
I recall reading somewhere that there were changes in Vault 1.12 which mean plugins may now be started briefly just so Vault can retrieve some metadata about them. Therefore, I wouldn't take the fact it is being started and then killed soon after as necessarily an indication that there is any problem.
Although... maybe this is itself the problem. It may be that your plugin is attempting initialization that writes to storage, when Vault only wants to run it briefly to query metadata, and storage writing is not enabled.
I am seeing in your logs that it appears to be attempting an AppRole login, and then exactly 60 seconds later that is timing out:
2023-02-14T23:27:54.462Z [DEBUG] auth.slauth.auth_slauth_8b16d672.slauth.slauth: trying to auth with app role parameters: app_role_id=a05d3ed0-136f-16a2-985a-6ce0122397e5 app_role_secret_id=dff88c9c-805b-8927-3318-f1389816e5dc timestamp=2023-02-14T23:27:54.461Z
2023-02-14T23:28:54.171Z [TRACE] auth.slauth.auth_slauth_8b16d672.slauth: setup: transport=gRPC status=finished err="rpc error: code = Canceled desc = context canceled" took=59.736892012s
I believe you might find that you now need to move some initialization logic to not run when the plugin is initialized, but when it first receives a request, to remedy this.
Hello @maxb , thanks for your response.
Yes, as pointed out, we are indeed injecting the role-id
and secret-id
when enabling the plugin, but the root cause for the timeout doesn't seem to be at the AppRole Login (I think). Previously, because of this 60s timeout (we were indeed getting an invalid response saying that the secret-id
was invalid, so we tried to overcome this problem by increasing the timeout from 1m
to 3m
. You should see in the logs that the token creation did succeed.
Throughout our analyse, we identified that the context is hanging after invoking, vault auth enable ...
. The plugin sends a request to res, err := vaultClient.Write("auth/approle/login", Data{"role_id": appRoleId, "secret_id": appRoleSecretId})
and it hangs for 60s here.
For reference, vaultClient.write
is:
func (c *defaultVaultClient) Write(path string, data Data) (*api.Secret, error) {
return c.client.Logical().Write(path, data) // Here is where I suspect it hangs for 60s.
}
and we configured our client as:
func NewClient(vaultAddr string, token string) (VaultClient, error) {
vaultUrl, err := url.Parse(vaultAddr)
if err != nil {
return nil, errors.Wrapf(err, "incorrect vault address provided")
}
clientCfg := api.DefaultConfig()
clientCfg.Address = vaultAddr
clientCfg.Timeout = time.Second * 200 // This was added to attempt overcome the 60s issue.
client, err := api.NewClient(clientCfg)
if err != nil {
return nil, errors.Wrapf(err, "failed to init vault client")
}
client.SetToken(token)
return &defaultVaultClient{
addrUrl: vaultUrl,
client: client,
}, nil
}
You should note that even after increasing the Client timeout to 200s it still failed after 1m30s.
2023-02-15T05:14:00.742Z [DEBUG] auth.slauth.auth_slauth_c56f54c5.slauth.slauth: trying to auth with app role parameters: app_role_id=46316cf6-3404-31ce-8cdb-80e43497341a app_role_secret_id=3f775f54-ae93-3a27-41c4-18236aa45dbc timestamp=2023-02-15T05:14:00.742Z
2023-02-15T05:15:30.490Z [TRACE] auth.slauth.auth_slauth_c56f54c5.slauth: setup: transport=gRPC status=finished err="rpc error: code = Canceled desc = context canceled" took=1m29.772511352s
2023-02-15T05:15:30.490Z [ERROR] secrets.system.system_27514380: error occurred during enable credential: path=slauth/ error="rpc error: code = Canceled desc = context canceled"
2023-02-15T05:15:30.492Z [DEBUG] identity: creating a new entity: alias="id:\"0f81f18e-1a55-0340-01a4-c63091414fb2\" canonical_id:\"bdea316e-3fe9-31f7-3796-49b2ceb0a093\" mount_type:\"approle\" mount_accessor:\"auth_approle_ee789955\" mount_path:\"auth/approle/\" metadata:{key:\"role_name\" value:\"slauth\"} name:\"46316cf6-3404-31ce-8cdb-80e43497341a\" creation_time:{seconds:1676438130 nanos:492014691} last_update_time:{seconds:1676438130 nanos:492014691} namespace_id:\"root\" local_bucket_key:\"packer/local-aliases/buckets/148\""
I hope this complementary information is useful.
Ah, OK. So in your previous logs, there were multiple attempts with different iterations of the code, and I just stopped at the first that seemed to correlate with the symptoms you mentioned. EDIT: No, I see now that the logs in your most recent comment supersede the attached file.
So, the real issue here, is why is your request for an AppRole login taking an incredibly long time.
Fundamentally, an AppRole login is just a few reads/writes to Vault's internal storage. If that's taking anywhere close to a minute to complete (EDIT: especially given you're using inmem
storage!), that's where debug/fix efforts need to be focussed.
Are you able to log in using the AppRole RoleID and SecretID that the plugin is trying to use, if you manually make the login request yourself, outside of the plugin?
I believe the 1m30s timeout you are now observing, is probably the default timeout applied to all requests by the Vault server - i.e. since you have increased the client timeout beyond this, you now see the Vault server deciding to time out your requests.
Just out of curiosity... how are you getting the RoleID and SecretID into the plugin? AFAIK you usually have to enable a plugin, and then configure it in a separate request to a config endpoint?
Are you able to log in using the AppRole RoleID and SecretID that the plugin is trying to use, if you manually make the login request yourself, outside of the plugin?
Yes! In fact, that's how I figured that the secret-id was expiring. I was able to successfully create the login token using the role-id
and secret-id
(and it was pretty quick tbh, less than a second actually).
Just out of curiosity... how are you getting the RoleID and SecretID into the plugin? AFAIK you usually have to enable a plugin, and then configure it in a separate request to a config endpoint?
For our local environment we are running this in a docker container, right. There is a script that is invoked with the docker-compose
in this case.
The TL;DR is:
vault plugin register -sha256=$SHA_256 -command=slauth auth slauth
b. We can confirm that it is registered successfully due to: Success! Registered plugin: slauth
vault policy write slauth-plugin /vault/slauth-plugin.hcl
b. Which also works due to: Success! Uploaded policy: slauth-plugin
approle
and we extract the role-id
and secret-id
, which will later be used to issue Vault tokens with the policy slauth-plugin
, allowing it to perform operations in the backend.vault auth enable -path=slauth \
-options=app_role_role_id=${ROLE_ID} \
-options=app_role_secret_id=${SECRET_ID} slauth
This is the operation that actually hangs in the backend. Which I described to you before.
Once again @maxb , thanks so much for trying to assist.
OK... I think the key question remains "Why is the AppRole login hanging?"... unfortunately I'm running out of ideas on how to investigate that further.
One thing that might be interesting, just to eliminate things, would be take Docker and the associated docker-compose environment, and your local Vault configuration out of the picture.
A Vault server can be started in dev mode with plugins enabled and full logging just by running
VAULT_LOG_LEVEL=trace vault server -dev -dev-plugin-dir=/path/to/dir
When you do this, it automatically scans the directory and registers plugins in it so you skip the requirement to use vault plugin register ...
, so you can go straight to running:
vault auth enable approle
vault write auth/approle/role/slauth token_ttl=24h
ROLE_ID=$(vault read auth/approle/role/agent/role-id -format=json | jq -r .data.role_id)
SECRET_ID=$(vault write -f auth/approle/role/agent/secret-id -format=json | jq -r .data.secret_id)
vault auth enable -options=app_role_role_id=${ROLE_ID} -options=app_role_secret_id=${SECRET_ID} slauth
I've skipped assigning any policy to the test approle here, since all we really care about is whether the login works.
I'm not sure if this will be fruitful or not, but hopefully it should be fairly easy to try out, just in case.
Beyond that, I think your next step should be to remove your AppRole login from executing during the plugin InitializeFunc
, and instead log in as needed when handling user requests. Even if we don't know exactly why it is hanging
The rest of my reply is not directly related to the issue at hand, it's just reactions to a couple of the background details:
On the use of vault auth enable -options=
for auth method config: The majority of the Vault plugins that I have worked with, tend to use a separate config
endpoint for specifying configuration after a plugin has been enabled in a separate step. I would suggest migrating your plugin to the same approach, as the options map is not intended for storing confidential information. With this current architecture, you can't give your users the ability to do something as usually low-privileged as check what auth methods exist - vault auth list
without revealing the approle secret_id.
On the need for a Vault plugin to itself log in to Vault: I too have encountered this need, but it's pretty unusual, as it is only needed when one Vault plugin wants to consume other Vault APIs as part of its operation. It's particularly unusual for an auth method, as usually they just need to confirm the user's credentials and return a confirmed username and metadata to Vault. Are you able to indulge my curiousity on what this plugin needs to call back in to Vault for?
Hello again @maxb, (sorry for the late reply)
OK... I think the key question remains "Why is the AppRole login hanging?"... unfortunately I'm running out of ideas on how to investigate that further.
Yes, I reckon that's what we should probably be looking at. I'm also start to run out of ideas. :(
After adding a little bit more of logs, I noticed vault auth enable
seems to be returning a 400 response.
Error enabling slauth auth: Error making API request.
URL: POST $VAULT_ADDR/v1/sys/auth/slauth
Code: 400. Errors:
* rpc error: code = Canceled desc = context canceled
In regards to your suggestion of testing the plugin out of the docker context, I reckon it would be a bit harder to do, because the plugin does expect the role-id
and secret-id
to be provided, otherwise, it will fail the initialisation. Is there any way we can lazy load the plugin on vault server -dev
? I tried registering but then it complained with could not set plugin, plugin directory is not configured
.
Are you able to indulge my curiousity on what this plugin needs to call back in to Vault for?
Hmmmm, I'm not sure I follow the question here... Based on the policy attached to slauth-plugin
I can see that the main responsibilities for this plugin is to manage the ACL Policies, Entities, Groups, Secrets and also renew, revoke, and lookup-self tokens.
I'll continue trying to dig on my end on this issue, but I'm also running out of ideas at this stage. :(
After adding a little bit more of logs, I noticed
vault auth enable
seems to be returning a 400 response.
Yes, that would be expected given the previously shared logs.
In regards to your suggestion of testing the plugin out of the docker context, I reckon it would be a bit harder to do, because the plugin does expect the
role-id
andsecret-id
to be provided, otherwise, it will fail the initialisation.
Um, yes, but that's why I wrote out commands to create and supply those in my previous comment?
Yep, but what I mean is that the server itself won't even be up when I run: VAULT_LOG_LEVEL=trace vault server -dev -dev-plugin-dir=/path/to/dir
. It crashes with:
The server won't even be started successfully. See logs attached. logs.txt
I have also tried increasing max_request_duration
(default 90s) to 600s. It still fails. Also, during these latests tests I've been running, I'm no longer using approles
, instead, I'm using the Root Token.
I noticed that even methods the methods logical.read they are hanging. Very confusing.
Example:
//vault_client.go (used by our custom plugin)
func (c *defaultVaultClient) Read(path string) (*api.Secret, error) {
log.Println("I'm calling read method.")
l := c.client.Logical()
log.Println("I've passed the logical stuff.")
log.Println(path)
log.Println("Submitting the request.")
result, err := l.Read(path)
log.Println("Request completed, printing result and err and returning.")
log.Println(result)
log.Println(err)
return result, err
}
// vault api logical.go
func (c *Logical) Read(path string) (*Secret, error) {
return c.ReadWithDataWithContext(context.Background(), path, nil)
}
The final logs were printed as follow:
2023-02-16T00:43:57.359Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:43:57 I'm calling read method.
2023-02-16T00:43:57.359Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:43:57 I've passed the logical stuff.
2023-02-16T00:43:57.359Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:43:57 auth/token/lookup-self
2023-02-16T00:43:57.359Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:43:57 Submitting the request.
2023-02-16T00:53:51.828Z [TRACE] activity: writing segment on timer expiration
2023-02-16T00:53:57.227Z [TRACE] auth.slauth.auth_slauth_aa04ea8d.slauth: setup: transport=gRPC status=finished err="rpc error: code = Canceled desc = context canceled" took=9m59.769782398s
2023-02-16T00:53:57.228Z [ERROR] secrets.system.system_34fc058a: error occurred during enable credential: path=slauth/ error="rpc error: code = Canceled desc = context canceled"
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing accessors: prefix=accessor/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing role HMACs: prefix=secret_id/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing secret ID HMACs: role_hmac=f3ebf621f450a93b0d8c82e3b1dccc533f7750c8a266827584c9a0b78ee4cc3c/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: found expired secret ID
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing accessors: prefix=accessor_local/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing role HMACs: prefix=secret_id_local/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: done checking entries: num_entries=1
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:53:57 Request completed, printing result and err and returning.
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:53:57 &{eda1fd67-f03c-ebb5-114f-1b7d4de18144 0 false map[accessor:3O3NNBkBDsA2xUV1fQUQ6tBk creation_time:1676508231 creation_ttl:0 display_name:root entity_id: expire_time:<nil> explicit_max_ttl:0 id:hvs.7YRkxKx5tpPriVlPJ6OxJRln meta:<nil> num_uses:0 orphan:true path:auth/token/root policies:[root] ttl:0 type:service] [] <nil> <nil>}
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:53:57 <nil>
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: trying to store token in the internal storage: timestamp=2023-02-16T00:53:57.231Z
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: error was: : EXTRA_VALUE_AT_END="failed to store vault token: rpc error: code = Canceled desc = context canceled" timestamp=2023-02-16T00:53:57.231Z
@maxb, a quick update.
I ended up cloning vault
repository and started adding some logs for vault/api/logical
and also vault/api/client
. I noticed that the requests are being hanged at: https://github.com/hashicorp/vault/blame/main/api/client.go#L1391, which is due to the call at https://github.com/hashicorp/go-retryablehttp/blob/master/client.go#L628.
If I figured anything else I will update here, but that's what I have got so far. :(
Yep, but what I mean is that the server itself won't even be up when I run:
VAULT_LOG_LEVEL=trace vault server -dev -dev-plugin-dir=/path/to/dir
.
I do not understand... of course the server itself won't be up until you run the command to start the server... that is what the vault server -dev
command does.
It crashes with:
Did you mean to paste something here?
See logs attached.
From the posted logs, I am more convinced than ever that your plugin is performing initialization actions too soon.
Do what I said in an earlier comment:
Beyond that, I think your next step should be to remove your AppRole login from executing during the plugin
InitializeFunc
, and instead log in as needed when handling user requests.
Howdy @maxb , 👋
Did you mean to paste something here?
I ended up attaching the logs and forgot to remove it, sorry about that 😭 .
From the posted logs, I am more convinced than ever that your plugin is performing initialization actions too soon.
I'm still trying to get my head around this to be honest. Was there any change surround plugin initialisations that we should be aware of and I might be missing it? I don't think it will be easy for us to refactor it, I'd be happy to have a look at it, but I'm quite not 100% sure if this will solve the issue.
Do what I said in an earlier comment:
In regards to your previous comment, I'm sort of confused if removing the App Role login would help with anything. One of my previous comments, I mentioned that I left the App Role Login mechanism aside and replaced it with the Root Token. Which means that we create vault client
with the root token
and even for operations such as _, err = vaultClient.Read("auth/token/lookup-self")
is still being stuck.
Also, I was not able to find any InitializeFunc
being called.... The initialisation of the plugin seems to be being done as recommended?
factory := slauth.Factory{Version: Version}
if err := plugin.Serve(&plugin.ServeOpts{
BackendFactoryFunc: factory.Factory,
TLSProviderFunc: tlsProviderFunc,
}); err != nil {
logger := hclog.New(&hclog.LoggerOptions{})
logger.Error("plugin shutting down", "error", err)
os.Exit(1)
}
There is something super odd happening when invoking the http request. For some reason it just gets stuck. I'll try to revisit the initialisation of the plugin, but at this stage, I'm no longer sure what I should try.
Once again, thanks for trying to assist.
Can you share more of the source code? Ideally as much as you can, and just redact any particularly internal bits, but definitely slauth.Factory.Factory
and anything it calls.
@heliomartins Hello!
What version of the Vault sdk and api are you using?
Vault 1.12 did introduce a change to how plugins are loaded. Prior to Vault 1.12 plugins are lazy loaded on startup. This means that plugin processes are killed after a successful mount and then respawned when a request is routed to them. Vault 1.12 introduced AutoMTLS for secrets/auth plugins so we do not lazy load them on startup anymore.
I will get this added to our 1.12 upgrade guide ASAP.
@maxb , unfortunately I won't be able to share all the contents of the plugin, but I extracted as much as I could related to plugin initialisation. I hope this helps. snippet-plugin-init.txt
@fairclothjm , thanks for jumping in: I reckon it would be interesting to understand a bit more about this. =) If you have any draft of that it would be amazing to give it a shot.
We are trying to update to the latest version 1.12.3
.
github.com/hashicorp/vault/api v1.9.0
github.com/hashicorp/vault/sdk v0.8.1
Ah, I see you're already trying to detect the plugin being launched in metadata mode and disable certain initialization in that case.
I suspect the issue is that this is no longer enough of a workaround.
I believe there's kind of an unwritten rule that Vault plugins should only be filling in the fields of the logical.Backend
in the Factory
func, and definitely not doing any initialization with external dependencies.
A relatively simple refactor that would allow you to prove this, would be to move your clientFactory/SysLink initialization to a helper method of your AuthBackend
, and call that method, protected by a sync.Once
that you also add to the AuthBackend
, from each request callback.
That would ensure it only gets triggered once Vault routes a request which needs it to your plugin.
I don't suppose you would have any example for this scenario would you? Or even pointing to any documentation that perhaps could help me attempt this migration?
I did some other tests Today, and I noticed that if we invoke those vault
requests using the CLI version of 1.11.7
for example, the initialisation of the plugin succeeds.
I'm attaching the logs with the difference between the initialisation using vault cli 1.11.7
and vault cli 1.12.3
. Note that we have just changed the CLI version, we are still trying to build Vault with 1.12.3
version.
logs-vault-cli-1-11-7.txt
logs-vault-cli-1-12-3.txt
The following logs does seem interesting:
2023-02-17T02:35:45.964Z [INFO] auth.slauth.auth_slauth_ec4c2744.slauth: configuring client automatic mTLS
2023-02-17T02:35:46.006Z [INFO] auth.slauth.auth_slauth_ec4c2744.slauth.slauth: configuring server automatic mTLS: timestamp=2023-02-17T02:35:46.006Z
I wonder if there is anything we need to do to handle this auto mTLS thing. @fairclothjm, is there any documentation guiding on how to do that?
One more update:
Max, I was able to perform the test (disregarding the docker compose) by doing:
/tmp/vault/plugins
plugin_directory = "/private/tmp/vault/plugins"
(private seems to be required for MacOS)/tmp/vault/plugins
VAULT_LOG_LEVEL=trace vault server -dev -dev-root-token-id="root" -config=/tmp/vault.hcl
SHASUM=$(shasum -a 256 "/tmp/vault/plugins/slauth" | cut -d " " -f1)
vault plugin register -sha256="${SHASUM}" -command="slauth" auth slauth
vault auth enable approle
vault write auth/approle/role/slauth token_ttl=24h
ROLE_ID=$(vault read auth/approle/role/slauth/role-id -format=json | jq -r .data.role_id)
SECRET_ID=$(vault write -f auth/approle/role/slauth/secret-id -format=json | jq -r .data.secret_id)
vault auth enable -options=app_role_role_id=${ROLE_ID} -options=app_role_secret_id=${SECRET_ID} -options='vault_addr=http://127.0.0.1:8200' slauth
.The problem still persists. :(
One of my teammates suggested trying to get a crash dump where it is hanging. See the coroutines crash logs. Not sure if this will help though. goroutines-log.txt
At this stage, I'm not sure what else I should try.
Here is a toy example of what I'm suggesting you do: https://github.com/maxb/simple-vault-plugin/blob/main/main.go
@heliomartins
I wonder if there is anything we need to do to handle this auto mTLS thing
I think @maxb is right on the money here. I would suggest trying out his proposed changes to your plugin. Maybe you could move your initialization code to the InitializeFunc which you would pass to the framework.Backend
in your Factory function. That might remove the need for the sync.Once protection and simplify things for your refactor.
Max is correct that plugins should not be doing any initialization with external dependencies in the Factory function. I will also add that having a dependency on MetadataMode in your plugin is not recommended. If Vault ever removes the concept of MetadataMode then you will again run into issues.
@maxb Thank you for all the help here!
I updated my example to use InitializeFunc
rather than sync.Once
(https://github.com/maxb/simple-vault-plugin/commit/0712bbc2035a33411fcd56612089962c61244705).
I went with sync.Once
because I was paralleling the way that framework.Backend
handles initialization of its path regexes, but it appears that was initially written long ago before InitializeFunc
was added, so I guess that's why it doesn't rely on that functionality.
Hello folks, I hope you all had a good weekend.
@maxb, thank you very much for sharing this snippet. I'm going to have a look into this. In the meanwhile, when you or @fairclothjm get the chance, do you mind elaborating a little bit more why we should not be using Metadata mode? I'm not 100% sure I understand the impact of using it. I'm afraid we don't have much context why it was created like that. Maybe an old issue on Vault that was later resolved?
Thanks once again for assisting with this.
Hello @maxb and @fairclothjm, unfortunately we were still not able to successfully enable the auth
plugin, even after trying to refactor the application to use InitializeFunc
.
I've just pushed a simplified version of the plugin to GitHub (https://github.com/heliobmartins/vault-context-deadline) that should allow you to reproduce the problem. The README.md contains the steps to reproduce the issue.
Also, you should be able to see the Pull Request I raised to use the InitializeFunc (https://github.com/heliobmartins/vault-context-deadline/pull/1) which fails with the same error.
I hope the repository help you to reproduce the error. Helio
do you mind elaborating a little bit more why we should not be using Metadata mode?
I believe the point is that it is unclear whether the environment variable to signify metadata mode can be considered a public API or not, and thus, whether there is a commitment to it continuing to exist and/or behave the same way in the future.
I've just pushed a simplified version of the plugin to GitHub (https://github.com/heliobmartins/vault-context-deadline) that should allow you to reproduce the problem.
Ah, now it is really getting interesting. Experimenting with this, a few things become clear:
These observations strongly hint at lock contention within Vault. I suspect some change in Vault 1.12 has introduced additional lock contention so that it is now impossible to make other requests to Vault whilst an auth backend is being enabled.
A bit of git bisect idenfies #16157 as the culprit - which adds taking a read-lock on Core.authLock
to pretty much every Vault request!
Nice sleuthing @maxb! I agree with your analysis, though I'm a bit hesitant to vilify #16157. As you say now pretty much all requests need to grab that lock, but even before that change many requests would, e.g. pathInternalUIMountsRead is invoked for most kv
cli operations to determine whether the mount is kvv1 or kvv2. I'm hoping that maybe we can change the mounting functions to release the lock before they call Initialize, rather than revisit DetermineRoleFromLoginRequest.
No vilification intended - just the observation that that's the commit at which the behaviour changed.
Right, I just meant that while that PR is the culprit in a sense, insofar as it's made an existing problem worse, I'd prefer to focus on reducing how long we hold the lock than on trying to change what that PR is doing.
@heliobmartins Now that we understand the underlying issue, you might want to give the sync.Once
way of initializing that I previously suggested before the discussion of InitializeFunc
(https://github.com/maxb/simple-vault-plugin/blob/7593415fe3f85c23da009afa75055c1479dff697/main.go) a go instead. It does have the disadvantage that you have to remember to insert
b.initOnce.Do(b.init)
at the top of all your request handling methods, but it should be enough to defer initialization to first use of the auth method, working around this locking issue.
Hello @maxb, 👋
🤦 I probably should've pushed a reproducible version of the plugin earlier. My bad.
I will try to understand a bit more how I'm going to refactor the plugin to use the sync.Once
suggestion and I will get back to you once I get the result.
Now that you better understand the issue, do you think that this is a bug that was introduced in Vault core? Or do you still think that this is related to our plugin initialisation? Because if that's the latter, I reckon I probably need to raise a ticket in our backlog to improve it as soon as possible, because ideally, we should be following the best practices for Plugin Development anyway.
Thanks once again for all your help. I will get back to you soon.
Hello @maxb, just a quick follow-up here, we've also tried to implement the sync.Once
approach and unfortunately it didn't work either. 😞
I tried with 2 different approaches and none of them worked. :(
Adding the reference for the changes here. https://github.com/heliobmartins/vault-context-deadline/pull/2 https://github.com/heliobmartins/vault-context-deadline/pull/3
What are the potential consequences if we actually try to add a new context prior invoking the initialisation? Example: https://github.com/heliobmartins/vault-context-deadline/pull/4/files
/\ With the above, we were actually able to load the plugin at least:
~ ❯ vault auth enable -options=app_role_role_id=${ROLE_ID} -options=app_role_secret_id=${SECRET_ID} -options='vault_addr=http://127.0.0.1:8200' slauth
Success! Enabled slauth auth method at: slauth/
Please refer back to https://github.com/maxb/simple-vault-plugin/blob/main/main.go - the whole point here is that you do not trigger the initialization anywhere that gets called during Factory
, and only trigger it later when the plugin receives its first request. Both your 2 and 3 end up triggering the initialization during Factory
.
In your 4 you have created a race condition, as you now have no guarantee initialization has actually completed before the first actual request to your plugin occurs.
Now that you better understand the issue, do you think that this is a bug that was introduced in Vault core? Or do you still think that this is related to our plugin initialisation? Because if that's the latter, I reckon I probably need to raise a ticket in our backlog to improve it as soon as possible, because ideally, we should be following the best practices for Plugin Development anyway.
Vault never documented either way, whether it was allowed to depend upon the Vault API during enabling a backend. You relied on this, but Vault has subsequently, accidentally, introduced lock contention which makes things break. Is this is a bug in Vault? I think it is unclear whether it is or not, as the plugin development docs do not rigorously document what a plugin author may or may not rely upon.
I do think there are some issues in your plugin that would ideally addressed:
1) Even without a rigorous specification, it is heavily implied by the design of existing plugins, that the Factory
function should avoid any initialization that can fail. It should just be wiring up objects.
2) Your plugin is currently not calling b.Backend.Setup
to pass the logical.BackendConfig
on to your framework.Backend
. This may result in some features of the framework.Backend
not working.
3) You have login/slauthtoken
declared as an unauthenticated path, despite just plain login
being the only path that is implemented.
4) The assumption that the plugin only ever needs to log in to Vault once ever is not very safe. Ideally it would be resilient to various kinds of potential failure: The initial login attempt could fail for some reason. The token could expire, or be revoked. In either of these cases, logging in again would make sense. (Note that implementing this means moving away from sync.Once
to a more flexible synchronization primitive, probably sync.RWMutex
. I only suggested sync.Once
as a least-effort path to overcoming the immediate problem.)
5) Storing the Vault token in Vault storage does not make sense to me. It means more read/write to Vault storage, which means more gRPC calls back and forth between the main Vault server and your plugin. I'd just keep it in (properly atomically accessed) memory. A new token can be acquired by logging in again after the plugin is restarted.
6) I previously remarked in this issue thread about storing the approle credentials in plugin options being undesirable. They should be moved to a dedicated config
endpoint within the plugin, mirroring how many existing auth methods (e.g. LDAP and Kubernetes) handle their configuration.
Hey @maxb, 👋
Thanks for all your input.
- Even without a rigorous specification, it is heavily implied by the design of existing plugins, that the
Factory
function should avoid any initialization that can fail. It should just be wiring up objects.- Your plugin is currently not calling
b.Backend.Setup
to pass thelogical.BackendConfig
on to yourframework.Backend
. This may result in some features of theframework.Backend
not working.
Does Hashicorp provide any documentation to follow this best practices? I mean, I couldn't even find any reference for logical.BackendConfig
and how to use what you're suggesting here.
- You have
login/slauthtoken
declared as an unauthenticated path, despite just plain login being the only path that is implemented.
As I mentioned in my comment https://github.com/hashicorp/vault/issues/19192#issuecomment-1436280530 this is not the real plugin, it is actually missing a lot of things there... What I pushed is just a reproducible version of the problem we are trying to investigate. The Paths are not meant to be relevant, since the whole problem we are trying to identify is the auth enable
. If it can't be enabled, it doesn't even make sense to have the paths.
- The assumption that the plugin only ever needs to log in to Vault once ever is not very safe. Ideally it would be resilient to various kinds of potential failure: The initial login attempt could fail for some reason. The token could expire, or be revoked. In either of these cases, logging in again would make sense. (Note that implementing this means moving away from sync.Once to a more flexible synchronization primitive, probably sync.RWMutex. I only suggested sync.Once as a least-effort path to overcoming the immediate problem.)
We are making usage of sync.RWMutex
for some of these paths. Example for the real login/slauthToken
const PathLoginSlauthToken = "login/slauthtoken" //#nosec
const ParamSlauthToken = "slauthToken"
func (b *AuthBackend) backendPaths() []*framework.Path {
authLoginUserOperation := framework.PathOperation{
Callback: b.withMetadataModeCheck(b.pathAuthLoginUser),
}
return []*framework.Path{
{
Pattern: PathLoginSlauthToken,
Fields: map[string]*framework.FieldSchema{
ParamSlauthToken: {
Type: framework.TypeString,
Description: "SLAuth token to authenticate",
},
},
Operations: map[logical.Operation]framework.OperationHandler{
logical.CreateOperation: &authLoginUserOperation,
},
}
}
func (b *AuthBackend) pathAuthLoginUser(ctx context.Context, req *logical.Request, d *framework.FieldData) (*logical.Response, error) {
defer b.tokenRenewLock.BeginRead()()
slauthToken := d.Get(ParamSlauthToken).(string)
//Handles the LoginUser Login (not added to the snippet)
return &logical.Response{
Data: Data{"identity": entity.Id},
Auth: &logical.Auth{
EntityID: entity.Id,
Period: userAuth.ttl,
TokenType: logical.TokenTypeBatch,
DisplayName: identityData.Name,
},
}, nil
}
func (b *ReadWriteLock) BeginRead() func() {
// begin reading
b.rwMutex.RLock()
return func() {
// end reading
b.rwMutex.RUnlock()
}
}
In regards to token expiration, etc... this isbeing handled by our plugin, through a set of methods. (Maybe this is not ideal, but I'm not 100% sure what was the motivation on doing like that. Unfortunately, I didn't belong to the team responsible for the plugin when it was implemented.
func (b *AuthBackend) doTokenRenew() error {
b.Logger().Debug("checking if token renewal needed")
return b.tokenRenewLock.ExecWrite(func() error {
needsRenewal, err := b.checkIfVaultLinkTokenNeedsRenewal()
if err != nil {
return errors.Wrapf(err, "failed to check if token needs renewal")
}
b.Logger().Debug("token needs renewal", "needsRenewal", needsRenewal)
if needsRenewal {
return b.vaultSysLink.RenewToken()
}
return nil
})
}
- Storing the Vault token in Vault storage does not make sense to me. It means more read/write to Vault storage, which means more gRPC calls back and forth between the main Vault server and your plugin. I'd just keep it in (properly atomically accessed) memory. A new token can be acquired by logging in again after the plugin is restarted.
Maybe this is something our team needs to revisit... We probably need to understand more what's being done here.
- I previously remarked in this issue thread about storing the approle credentials in plugin options being undesirable. They should be moved to a dedicated config endpoint within the plugin, mirroring how many existing auth methods (e.g. LDAP and Kubernetes) handle their configuration.
This is also probably we might need to understand better. I might not have all the context here.
I'm going to re-try the sync.Once suggestion, but I'm not 100% sure if this is going to work as expected due to the complexity of our plugin. It would be nice to understand if this extra lock
that was added to vault/core
will remain there, or if it will be removed in upcoming versions. Otherwise, I'm afraid that we are probably gonna need to do a lot of refactor in our plugin to overcome this problem and to be able to bump to any version greater than 1.11.x
.
Once again, thanks for your support in trying to assist here.
Does Hashicorp provide any documentation to follow this best practices?
I don't think so... everything I know about this, I have learnt by reading the source code - both that of the plugin SDK, and all the built in plugins within the Vault source tree.
It would be nice to understand if this extra
lock
that was added tovault/core
will remain there, or if it will be removed in upcoming versions.
@ncabatoff has indicated in earlier comments in this issue that the currently preferred direction is to mostly leave it as is, but try to reduce the length of time it is being held during the auth enable operation, so that it is no longer being held during the InitializeFunc
.
Regardless, moving your plugin's initialization to happen on first request via sync.Once
should enable you to sidestep the issue anyway, and not be dependent on this change.
Hello @maxb 👋
I ended up spending some time trying to refactor the plugin to use sync.Once
, and from what I can tell, it seems that the plugin was able to load successfully at least. Our integration and unit tests are working as expected after the refactor.
I just have a couple of follow up questions after making the change...
func (b *AuthBackend) init() {
err := b.Init()
if err != nil {
log.Panic("Failed plugin initialisation: ", err) //should we be panicking here? Is this the recommend/best approach?
}
}
func (b *AuthBackend) Init() error {
// handle all initialisation
ctx := context.Background()
if b.VaultSysLink == nil {
sysLink, err := util.NewVaultSysLink(ctx, b.conf, clientFactory)
if err != nil {
return errors.Wrapf(err, "failed to obtain sys link to vault backend")
}
b.VaultSysLink = sysLink
}
if b.SlauthTokenAuth == nil {
authenticator, err := initSlauthTokenAuth(b.conf)
if err != nil {
return errors.Wrapf(err, "failed to init slauthtoken authenticator component")
}
b.SlauthTokenAuth = authenticator
}
//
}
context canceled
response. I noticed later that the problem was because I was passing the ctx
parameter from Factory
to AuthBackend
. To overcome the issue, we created a new context in the init
function, like shown in previous snippet. I don't think we are gonna face any race condition in that case, right? Would that be the recommend approach in that case, or should we be using the context from anywhere else?@ncabatoff has indicated in earlier comments in this issue that the currently preferred direction is to mostly leave it as is, but try to reduce the length of time it is being held during the auth enable operation, so that it is no longer being held during the InitializeFunc.
I'm not sure I understand this part. Does this means we (plugin developers) can actually control the timeout of something through the InitializeFunc?
conf.StorageView
. Would you be able to share why? Or, at least provide some guidance for us to understand why this is wrong?plugin
and vault
, my understanding is that we are using a client https://github.com/heliobmartins/vault-context-deadline/blob/main/vault_client.go#L31-L49 that forwards HTTP requests. Is that the recommended approach? Or should we be looking into something else?Once again, thanks so much for your help.
- what would you do if during the initialisation of your plugin it returns an error?
The sync.Once
method means you have no option to implement any error recovery. If there is an error, your plugin will be permanently broken until a human administrator reloads it. Or, panicking might result in your plugin exiting (though I am not sure if that applies during a gRPC request handler... would need to test that) in which case Vault would try to restart it.
That's clearly not great, but it's the error handling situation you already had with your previous implementation, so I proposed the smallest possible change to fix the issue without further refactoring.
Longer term, you would replace sync.Once
with sync.RWMutex
, and implement more sophisticated code to retry initialization later when in an error state.
2. To overcome the issue, we created a new context
Contexts are being used here mostly to control the scope of cancellation. Reverting to context.Background()
(which is not a new context, it's an ever-present singleton) is the right thing to do when performing background operations which should not be cancelled even if their parent request is.
3. Would you be able share some points on the source code that it would be relevant for us to start reading more about plugin usage/best practices?
Um, not really. Or, all of it? You can't really summarize general familiarization with a few specific points.
4. I'm not sure I understand this part.
If Vault itself only held the relevant lock for certain critical sections of enabling a new auth method, and not whilst its InitializeFunc
was running, then the issue would be avoided.
5. You raised a point before that we should not be storing the the token in
conf.StorageView
.
The token is a short-lived credential which can (presumably?) be replaced at any time by doing another AppRole login, yet you are writing it to clustered network storage (the Vault store). It's not exactly wrong, but it's an inefficient choice, for which I can't see an obvious reason.
6. When it comes to the communication between the
plugin
andvault
, my understanding is that we are using a client https://github.com/heliobmartins/vault-context-deadline/blob/main/vault_client.go#L31-L49 that forwards HTTP requests.
When it comes to communication between Vault and any Vault plugin, the communication necessary to actually integrate with Vault as a plugin is via gRPC. But your plugin is a bit unusual, as it interacts with Vault both as a plugin (via gRPC), and as a regular client of the Vault API (via HTTP). That's fine, and there are no alternatives - Vault does not provide any special way for plugins to interact with the regular Vault API, so the rare plugins that need that, have to interact with it in two ways.
But, you're not forwarding HTTP requests from another source - you're just making them.
Hello again Max,
Once again, thank you very much for providing such assistance on this issue.
"Contexts are being used here mostly to control the scope of cancellation. Reverting to context.Background() (which is not a new context, it's an ever-present singleton) is the right thing to do when performing background operations which should not be cancelled even if their parent request is."
✅ In that case I assume we are handling it correct. 🙌🏽
"Um, not really. Or, all of it? You can't really summarize general familiarization with a few specific points."
Alright, no worries! I understand it can be hard. What I was mainly asking is if you would be able to point in Vault repository, where should we start reading to get to know more about plugin implementations. But don't worry about this.
"The token is a short-lived credential which can (presumably?) be replaced at any time by doing another AppRole login, yet you are writing it to clustered network storage (the Vault store). It's not exactly wrong, but it's an inefficient choice, for which I can't see an obvious reason."
❓ In that case, it would be more efficient to just issue Vault Tokens using the App Roles based on requests? Maybe we could reduce the TTL of these batch tokens, instead of have to handle operations such as Token Renew, etc? Is this how we should be handling it?
"But, you're not forwarding HTTP requests from another source - you're just making them."
✅ Alright! So it is fine and it is not wrong what we are doing. Thanks
We'll hopefully be rolling out the changes to Production Today. I will follow-up with the final result. 🙏🏾
Hello again @maxb ,
As promised, we released the changes to Prod and it looks good so far. Thank you once again for all your support throughout this issue.
In regards to the Github issue, I will let you guys decide if this extra lock
that was added should be considered as an issue or not. =)
Hello again team, 👋
Describe the bug Vault is returning
error: code = Canceled desc = context canceled" took=59.750225246s
when trying to register an externalauth
plugin. Thedebug
is followed by the error2023-02-15T00:58:57.417Z [ERROR] secrets.system.system_7f9e5cfe: error occurred during enable credential: path=$PLUGIN_PATH/ error="rpc error: code = Canceled desc = context canceled"
To Reproduce Steps to reproduce the behavior:
vault plugin register -sha256=${PLUGIN_SHA256} -command=${PLUGIN_COMMAND} auth ${PLUGIN_MOUNT}
eval vault auth enable -path=${PLUGIN_MOUNT} ${PLUGIN_MOUNT}
Expected behavior The registration of the plugin should and it should not be returning
error occurred during enable credential
Environment:
Vault Server Version (retrieve with
vault status
):Vault CLI Version (retrieve with
vault version
):Server Operating System/Architecture:
Other Dependencies
Additional context We are currently running a Vault Cluster with version 1.11.4 in our
production
andstaging
environment and we are trying to upgrade toVault 1.12.3
following the upgrade guide https://developer.hashicorp.com/vault/docs/upgrading/plugins.The error we are currently facing is on our local environment (we haven't yet tried to update the cluster since our Integration tests that depends on the external plugin is failing because of this issue).
After enabling
LOG_LEVEL=trace
we noticed that during the initialisation it looks like Vault core is successfully loading the plugin and for some reasoncore
kills it. (not sure if this expected)During the initialisation, we can confirm through the logs that the registration of the plugin was successful:
Running
vault plugin info auth $PLUGIN_NAME
does return the proper information of the plugin._Note1: We are using App Roles to enable the plugin _Note2: I'm not sure if this would help, but I extracted
vault-debug-logs
and you can find it attached. _Note3: Increasing the timeout of Vault Client didn't help either. We still got the same error.vault-debug-2023-02-14T23-26-56Z.tar.gz
Once again, thanks for your help into this. Helio