hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.29k stars 4.23k forks source link

Enabling Auth Plugin failing with context deadline exceeded #19192

Open heliobmartins opened 1 year ago

heliobmartins commented 1 year ago

Hello again team, 👋

Describe the bug Vault is returning error: code = Canceled desc = context canceled" took=59.750225246s when trying to register an external auth plugin. The debug is followed by the error 2023-02-15T00:58:57.417Z [ERROR] secrets.system.system_7f9e5cfe: error occurred during enable credential: path=$PLUGIN_PATH/ error="rpc error: code = Canceled desc = context canceled"

To Reproduce Steps to reproduce the behavior:

  1. Run vault plugin register -sha256=${PLUGIN_SHA256} -command=${PLUGIN_COMMAND} auth ${PLUGIN_MOUNT}
  2. Run eval vault auth enable -path=${PLUGIN_MOUNT} ${PLUGIN_MOUNT}

Expected behavior The registration of the plugin should and it should not be returning error occurred during enable credential

Environment:

Other Dependencies

github.com/hashicorp/vault/api v1.9.0
github.com/hashicorp/vault/sdk v0.8.1

Additional context We are currently running a Vault Cluster with version 1.11.4 in our production and staging environment and we are trying to upgrade to Vault 1.12.3 following the upgrade guide https://developer.hashicorp.com/vault/docs/upgrading/plugins.

The error we are currently facing is on our local environment (we haven't yet tried to update the cluster since our Integration tests that depends on the external plugin is failing because of this issue).

After enabling LOG_LEVEL=trace we noticed that during the initialisation it looks like Vault core is successfully loading the plugin and for some reason core kills it. (not sure if this expected)

2023-02-15T00:57:52.865Z [DEBUG] core: attempting to load backend plugin: name=slauth
2023-02-15T00:57:52.865Z [DEBUG] core: spawning a new plugin process: plugin_name=slauth id=U5n7XbLOxe
2023-02-15T00:57:53.101Z [DEBUG] core: successfully dispensed v5 backend plugin: name=slauth
2023-02-15T00:57:56.864Z [DEBUG] core: removed plugin client connection: id=U5n7XbLOxe
2023-02-15T00:57:56.869Z [DEBUG] core: killed external plugin process: path=/etc/vault/plugins/slauth pid=42
2023-02-15T00:57:57.148Z [INFO]  core: enabled credential backend: path=approle/ type=approle version=""
2023-02-15T00:57:57.417Z [DEBUG] system: pinning plugin version: plugin type=auth plugin name=slauth plugin version=v1.0.0
2023-02-15T00:57:57.417Z [DEBUG] core: spawning a new plugin process: plugin_name=slauth id=pPdTMTWzqp

During the initialisation, we can confirm through the logs that the registration of the plugin was successful:

vault plugin register -sha256=32143a4d4c732f760d1ea75ba56fff6311ec07f02d5764abdec68606c5f12cd7 -command=slauth auth slauth
Initializing slauth of type auth with checksum of 32143a4d4c732f760d1ea75ba56fff6311ec07f02d5764abdec68606c5f12cd7...
Success! Registered plugin: slauth

Running vault plugin info auth $PLUGIN_NAME does return the proper information of the plugin.

vault plugin info auth slauth
Key                   Value
---                   -----
args                  []
builtin               false
command               slauth
deprecation_status    n/a
name                  slauth
sha256                32143a4d4c732f760d1ea75ba56fff6311ec07f02d5764abdec68606c5f12cd7
version               n/a

_Note1: We are using App Roles to enable the plugin _Note2: I'm not sure if this would help, but I extracted vault-debug-logs and you can find it attached. _Note3: Increasing the timeout of Vault Client didn't help either. We still got the same error.

2023-02-15T04:57:20.303Z [TRACE] auth.slauth.auth_slauth_aca6e826.slauth: setup: transport=gRPC status=finished err="rpc error: code = Canceled desc = context canceled" took=1m29.760696907s
2023-02-15T04:57:20.303Z [ERROR] secrets.system.system_c46ab77e: error occurred during enable credential: path=slauth/ error="rpc error: code = Canceled desc = context canceled"

vault-debug-2023-02-14T23-26-56Z.tar.gz

Once again, thanks for your help into this. Helio

heliobmartins commented 1 year ago

I wonder if this is actually a continuation from the problem reported on: https://github.com/hashicorp/vault/issues/18323#issue-1492928626

If we compare the issue we were having before (the core was panicking) right after the log 2023-02-15T03:26:22.003Z [DEBUG] core: successfully dispensed v5 backend plugin: name=slauth.

Now, instead of panicking, it looks like the core is killing the external plugin (see: 2023-02-15T03:26:26.148Z [DEBUG] core: killed external plugin process: path=/etc/vault/plugins/slauth pid=161).

Please, let me know if you need any extra information.

maxb commented 1 year ago

I recall reading somewhere that there were changes in Vault 1.12 which mean plugins may now be started briefly just so Vault can retrieve some metadata about them. Therefore, I wouldn't take the fact it is being started and then killed soon after as necessarily an indication that there is any problem.

Although... maybe this is itself the problem. It may be that your plugin is attempting initialization that writes to storage, when Vault only wants to run it briefly to query metadata, and storage writing is not enabled.

I am seeing in your logs that it appears to be attempting an AppRole login, and then exactly 60 seconds later that is timing out:

2023-02-14T23:27:54.462Z [DEBUG] auth.slauth.auth_slauth_8b16d672.slauth.slauth: trying to auth with app role parameters: app_role_id=a05d3ed0-136f-16a2-985a-6ce0122397e5 app_role_secret_id=dff88c9c-805b-8927-3318-f1389816e5dc timestamp=2023-02-14T23:27:54.461Z

2023-02-14T23:28:54.171Z [TRACE] auth.slauth.auth_slauth_8b16d672.slauth: setup: transport=gRPC status=finished err="rpc error: code = Canceled desc = context canceled" took=59.736892012s

I believe you might find that you now need to move some initialization logic to not run when the plugin is initialized, but when it first receives a request, to remedy this.

heliobmartins commented 1 year ago

Hello @maxb , thanks for your response.

Yes, as pointed out, we are indeed injecting the role-id and secret-id when enabling the plugin, but the root cause for the timeout doesn't seem to be at the AppRole Login (I think). Previously, because of this 60s timeout (we were indeed getting an invalid response saying that the secret-id was invalid, so we tried to overcome this problem by increasing the timeout from 1m to 3m. You should see in the logs that the token creation did succeed.

Throughout our analyse, we identified that the context is hanging after invoking, vault auth enable .... The plugin sends a request to res, err := vaultClient.Write("auth/approle/login", Data{"role_id": appRoleId, "secret_id": appRoleSecretId}) and it hangs for 60s here.

For reference, vaultClient.write is:

func (c *defaultVaultClient) Write(path string, data Data) (*api.Secret, error) {
    return c.client.Logical().Write(path, data) // Here is where I suspect it hangs for 60s.
}

and we configured our client as:

func NewClient(vaultAddr string, token string) (VaultClient, error) {
    vaultUrl, err := url.Parse(vaultAddr)
    if err != nil {
        return nil, errors.Wrapf(err, "incorrect vault address provided")
    }

    clientCfg := api.DefaultConfig()
    clientCfg.Address = vaultAddr
    clientCfg.Timeout = time.Second * 200 // This was added to attempt overcome the 60s issue.
    client, err := api.NewClient(clientCfg)
    if err != nil {
        return nil, errors.Wrapf(err, "failed to init vault client")
    }
    client.SetToken(token)

    return &defaultVaultClient{
        addrUrl: vaultUrl,
        client:  client,
    }, nil
}

You should note that even after increasing the Client timeout to 200s it still failed after 1m30s.

2023-02-15T05:14:00.742Z [DEBUG] auth.slauth.auth_slauth_c56f54c5.slauth.slauth: trying to auth with app role parameters: app_role_id=46316cf6-3404-31ce-8cdb-80e43497341a app_role_secret_id=3f775f54-ae93-3a27-41c4-18236aa45dbc timestamp=2023-02-15T05:14:00.742Z
2023-02-15T05:15:30.490Z [TRACE] auth.slauth.auth_slauth_c56f54c5.slauth: setup: transport=gRPC status=finished err="rpc error: code = Canceled desc = context canceled" took=1m29.772511352s
2023-02-15T05:15:30.490Z [ERROR] secrets.system.system_27514380: error occurred during enable credential: path=slauth/ error="rpc error: code = Canceled desc = context canceled"
2023-02-15T05:15:30.492Z [DEBUG] identity: creating a new entity: alias="id:\"0f81f18e-1a55-0340-01a4-c63091414fb2\" canonical_id:\"bdea316e-3fe9-31f7-3796-49b2ceb0a093\" mount_type:\"approle\" mount_accessor:\"auth_approle_ee789955\" mount_path:\"auth/approle/\" metadata:{key:\"role_name\" value:\"slauth\"} name:\"46316cf6-3404-31ce-8cdb-80e43497341a\" creation_time:{seconds:1676438130 nanos:492014691} last_update_time:{seconds:1676438130 nanos:492014691} namespace_id:\"root\" local_bucket_key:\"packer/local-aliases/buckets/148\""

I hope this complementary information is useful.

maxb commented 1 year ago

Ah, OK. So in your previous logs, there were multiple attempts with different iterations of the code, and I just stopped at the first that seemed to correlate with the symptoms you mentioned. EDIT: No, I see now that the logs in your most recent comment supersede the attached file.

So, the real issue here, is why is your request for an AppRole login taking an incredibly long time.

Fundamentally, an AppRole login is just a few reads/writes to Vault's internal storage. If that's taking anywhere close to a minute to complete (EDIT: especially given you're using inmem storage!), that's where debug/fix efforts need to be focussed.

Are you able to log in using the AppRole RoleID and SecretID that the plugin is trying to use, if you manually make the login request yourself, outside of the plugin?

I believe the 1m30s timeout you are now observing, is probably the default timeout applied to all requests by the Vault server - i.e. since you have increased the client timeout beyond this, you now see the Vault server deciding to time out your requests.

Just out of curiosity... how are you getting the RoleID and SecretID into the plugin? AFAIK you usually have to enable a plugin, and then configure it in a separate request to a config endpoint?

heliobmartins commented 1 year ago

Are you able to log in using the AppRole RoleID and SecretID that the plugin is trying to use, if you manually make the login request yourself, outside of the plugin?

Yes! In fact, that's how I figured that the secret-id was expiring. I was able to successfully create the login token using the role-id and secret-id (and it was pretty quick tbh, less than a second actually).

Just out of curiosity... how are you getting the RoleID and SecretID into the plugin? AFAIK you usually have to enable a plugin, and then configure it in a separate request to a config endpoint?

For our local environment we are running this in a docker container, right. There is a script that is invoked with the docker-compose in this case.

The TL;DR is:

  1. We need to register the auth plugin, and we do that by running: a. vault plugin register -sha256=$SHA_256 -command=slauth auth slauth b. We can confirm that it is registered successfully due to: Success! Registered plugin: slauth
  2. Next, the policy that will be attached to slauth approle is created a. vault policy write slauth-plugin /vault/slauth-plugin.hcl b. Which also works due to: Success! Uploaded policy: slauth-plugin
  3. We create a new approle and we extract the role-id and secret-id, which will later be used to issue Vault tokens with the policy slauth-plugin , allowing it to perform operations in the backend.
  4. Final step (where it fails), we need to enable the plugin, and we do that by doing:
    vault auth enable -path=slauth \
    -options=app_role_role_id=${ROLE_ID} \
    -options=app_role_secret_id=${SECRET_ID} slauth

    This is the operation that actually hangs in the backend. Which I described to you before.

Once again @maxb , thanks so much for trying to assist.

maxb commented 1 year ago

OK... I think the key question remains "Why is the AppRole login hanging?"... unfortunately I'm running out of ideas on how to investigate that further.

One thing that might be interesting, just to eliminate things, would be take Docker and the associated docker-compose environment, and your local Vault configuration out of the picture.

A Vault server can be started in dev mode with plugins enabled and full logging just by running

VAULT_LOG_LEVEL=trace vault server -dev -dev-plugin-dir=/path/to/dir

When you do this, it automatically scans the directory and registers plugins in it so you skip the requirement to use vault plugin register ..., so you can go straight to running:

vault auth enable approle
vault write auth/approle/role/slauth token_ttl=24h
ROLE_ID=$(vault read auth/approle/role/agent/role-id -format=json | jq -r .data.role_id)
SECRET_ID=$(vault write -f auth/approle/role/agent/secret-id -format=json | jq -r .data.secret_id)
vault auth enable -options=app_role_role_id=${ROLE_ID} -options=app_role_secret_id=${SECRET_ID} slauth

I've skipped assigning any policy to the test approle here, since all we really care about is whether the login works.

I'm not sure if this will be fruitful or not, but hopefully it should be fairly easy to try out, just in case.

Beyond that, I think your next step should be to remove your AppRole login from executing during the plugin InitializeFunc, and instead log in as needed when handling user requests. Even if we don't know exactly why it is hanging


The rest of my reply is not directly related to the issue at hand, it's just reactions to a couple of the background details:

On the use of vault auth enable -options= for auth method config: The majority of the Vault plugins that I have worked with, tend to use a separate config endpoint for specifying configuration after a plugin has been enabled in a separate step. I would suggest migrating your plugin to the same approach, as the options map is not intended for storing confidential information. With this current architecture, you can't give your users the ability to do something as usually low-privileged as check what auth methods exist - vault auth list without revealing the approle secret_id.

On the need for a Vault plugin to itself log in to Vault: I too have encountered this need, but it's pretty unusual, as it is only needed when one Vault plugin wants to consume other Vault APIs as part of its operation. It's particularly unusual for an auth method, as usually they just need to confirm the user's credentials and return a confirmed username and metadata to Vault. Are you able to indulge my curiousity on what this plugin needs to call back in to Vault for?

heliobmartins commented 1 year ago

Hello again @maxb, (sorry for the late reply)

OK... I think the key question remains "Why is the AppRole login hanging?"... unfortunately I'm running out of ideas on how to investigate that further.

Yes, I reckon that's what we should probably be looking at. I'm also start to run out of ideas. :(

After adding a little bit more of logs, I noticed vault auth enable seems to be returning a 400 response.

Error enabling slauth auth: Error making API request.

URL: POST $VAULT_ADDR/v1/sys/auth/slauth
Code: 400. Errors:

* rpc error: code = Canceled desc = context canceled

In regards to your suggestion of testing the plugin out of the docker context, I reckon it would be a bit harder to do, because the plugin does expect the role-id and secret-id to be provided, otherwise, it will fail the initialisation. Is there any way we can lazy load the plugin on vault server -dev? I tried registering but then it complained with could not set plugin, plugin directory is not configured.


Are you able to indulge my curiousity on what this plugin needs to call back in to Vault for?

Hmmmm, I'm not sure I follow the question here... Based on the policy attached to slauth-plugin I can see that the main responsibilities for this plugin is to manage the ACL Policies, Entities, Groups, Secrets and also renew, revoke, and lookup-self tokens.

I'll continue trying to dig on my end on this issue, but I'm also running out of ideas at this stage. :(

maxb commented 1 year ago

After adding a little bit more of logs, I noticed vault auth enable seems to be returning a 400 response.

Yes, that would be expected given the previously shared logs.

In regards to your suggestion of testing the plugin out of the docker context, I reckon it would be a bit harder to do, because the plugin does expect the role-id and secret-id to be provided, otherwise, it will fail the initialisation.

Um, yes, but that's why I wrote out commands to create and supply those in my previous comment?

heliobmartins commented 1 year ago

Yep, but what I mean is that the server itself won't even be up when I run: VAULT_LOG_LEVEL=trace vault server -dev -dev-plugin-dir=/path/to/dir. It crashes with:

The server won't even be started successfully. See logs attached. logs.txt


I have also tried increasing max_request_duration (default 90s) to 600s. It still fails. Also, during these latests tests I've been running, I'm no longer using approles, instead, I'm using the Root Token.

I noticed that even methods the methods logical.read they are hanging. Very confusing.

Example:

//vault_client.go (used by our custom plugin)
func (c *defaultVaultClient) Read(path string) (*api.Secret, error) {
    log.Println("I'm calling read method.")
    l := c.client.Logical()
    log.Println("I've passed the logical stuff.")
    log.Println(path)
    log.Println("Submitting the request.")
    result, err := l.Read(path)
    log.Println("Request completed, printing result and err and returning.")
    log.Println(result)
    log.Println(err)
    return result, err
}
// vault api logical.go
func (c *Logical) Read(path string) (*Secret, error) {
    return c.ReadWithDataWithContext(context.Background(), path, nil)
}

The final logs were printed as follow:

2023-02-16T00:43:57.359Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:43:57 I'm calling read method.
2023-02-16T00:43:57.359Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:43:57 I've passed the logical stuff.
2023-02-16T00:43:57.359Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:43:57 auth/token/lookup-self
2023-02-16T00:43:57.359Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:43:57 Submitting the request.
2023-02-16T00:53:51.828Z [TRACE] activity: writing segment on timer expiration
2023-02-16T00:53:57.227Z [TRACE] auth.slauth.auth_slauth_aa04ea8d.slauth: setup: transport=gRPC status=finished err="rpc error: code = Canceled desc = context canceled" took=9m59.769782398s
2023-02-16T00:53:57.228Z [ERROR] secrets.system.system_34fc058a: error occurred during enable credential: path=slauth/ error="rpc error: code = Canceled desc = context canceled"
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing accessors: prefix=accessor/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing role HMACs: prefix=secret_id/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing secret ID HMACs: role_hmac=f3ebf621f450a93b0d8c82e3b1dccc533f7750c8a266827584c9a0b78ee4cc3c/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: found expired secret ID
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing accessors: prefix=accessor_local/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: listing role HMACs: prefix=secret_id_local/
2023-02-16T00:53:57.229Z [TRACE] auth.approle.auth_approle_5fdaacc6.tidy: done checking entries: num_entries=1
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:53:57 Request completed, printing result and err and returning.
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:53:57 &{eda1fd67-f03c-ebb5-114f-1b7d4de18144  0 false map[accessor:3O3NNBkBDsA2xUV1fQUQ6tBk creation_time:1676508231 creation_ttl:0 display_name:root entity_id: expire_time:<nil> explicit_max_ttl:0 id:hvs.7YRkxKx5tpPriVlPJ6OxJRln meta:<nil> num_uses:0 orphan:true path:auth/token/root policies:[root] ttl:0 type:service] [] <nil> <nil>}
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: 2023/02/16 00:53:57 <nil>
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: trying to store token in the internal storage: timestamp=2023-02-16T00:53:57.231Z
2023-02-16T00:53:57.231Z [DEBUG] auth.slauth.auth_slauth_aa04ea8d.slauth.slauth: error was: : EXTRA_VALUE_AT_END="failed to store vault token: rpc error: code = Canceled desc = context canceled" timestamp=2023-02-16T00:53:57.231Z
heliobmartins commented 1 year ago

@maxb, a quick update.

I ended up cloning vault repository and started adding some logs for vault/api/logical and also vault/api/client. I noticed that the requests are being hanged at: https://github.com/hashicorp/vault/blame/main/api/client.go#L1391, which is due to the call at https://github.com/hashicorp/go-retryablehttp/blob/master/client.go#L628.

If I figured anything else I will update here, but that's what I have got so far. :(

maxb commented 1 year ago

Yep, but what I mean is that the server itself won't even be up when I run: VAULT_LOG_LEVEL=trace vault server -dev -dev-plugin-dir=/path/to/dir.

I do not understand... of course the server itself won't be up until you run the command to start the server... that is what the vault server -dev command does.

It crashes with:

Did you mean to paste something here?

See logs attached.

From the posted logs, I am more convinced than ever that your plugin is performing initialization actions too soon.

Do what I said in an earlier comment:

Beyond that, I think your next step should be to remove your AppRole login from executing during the plugin InitializeFunc, and instead log in as needed when handling user requests.

heliobmartins commented 1 year ago

Howdy @maxb , 👋

Did you mean to paste something here?

I ended up attaching the logs and forgot to remove it, sorry about that 😭 .

From the posted logs, I am more convinced than ever that your plugin is performing initialization actions too soon.

I'm still trying to get my head around this to be honest. Was there any change surround plugin initialisations that we should be aware of and I might be missing it? I don't think it will be easy for us to refactor it, I'd be happy to have a look at it, but I'm quite not 100% sure if this will solve the issue.

Do what I said in an earlier comment:

In regards to your previous comment, I'm sort of confused if removing the App Role login would help with anything. One of my previous comments, I mentioned that I left the App Role Login mechanism aside and replaced it with the Root Token. Which means that we create vault client with the root token and even for operations such as _, err = vaultClient.Read("auth/token/lookup-self") is still being stuck.

Also, I was not able to find any InitializeFunc being called.... The initialisation of the plugin seems to be being done as recommended?

    factory := slauth.Factory{Version: Version}
    if err := plugin.Serve(&plugin.ServeOpts{
        BackendFactoryFunc: factory.Factory,
        TLSProviderFunc:    tlsProviderFunc,
    }); err != nil {
        logger := hclog.New(&hclog.LoggerOptions{})
        logger.Error("plugin shutting down", "error", err)
        os.Exit(1)
    }

There is something super odd happening when invoking the http request. For some reason it just gets stuck. I'll try to revisit the initialisation of the plugin, but at this stage, I'm no longer sure what I should try.

Once again, thanks for trying to assist.

maxb commented 1 year ago

Can you share more of the source code? Ideally as much as you can, and just redact any particularly internal bits, but definitely slauth.Factory.Factory and anything it calls.

fairclothjm commented 1 year ago

@heliomartins Hello!

What version of the Vault sdk and api are you using?

Vault 1.12 did introduce a change to how plugins are loaded. Prior to Vault 1.12 plugins are lazy loaded on startup. This means that plugin processes are killed after a successful mount and then respawned when a request is routed to them. Vault 1.12 introduced AutoMTLS for secrets/auth plugins so we do not lazy load them on startup anymore.

I will get this added to our 1.12 upgrade guide ASAP.

heliobmartins commented 1 year ago

@maxb , unfortunately I won't be able to share all the contents of the plugin, but I extracted as much as I could related to plugin initialisation. I hope this helps. snippet-plugin-init.txt

@fairclothjm , thanks for jumping in: I reckon it would be interesting to understand a bit more about this. =) If you have any draft of that it would be amazing to give it a shot.

We are trying to update to the latest version 1.12.3.

github.com/hashicorp/vault/api v1.9.0
github.com/hashicorp/vault/sdk v0.8.1
maxb commented 1 year ago

Ah, I see you're already trying to detect the plugin being launched in metadata mode and disable certain initialization in that case.

I suspect the issue is that this is no longer enough of a workaround.

I believe there's kind of an unwritten rule that Vault plugins should only be filling in the fields of the logical.Backend in the Factory func, and definitely not doing any initialization with external dependencies.

A relatively simple refactor that would allow you to prove this, would be to move your clientFactory/SysLink initialization to a helper method of your AuthBackend, and call that method, protected by a sync.Once that you also add to the AuthBackend, from each request callback.

That would ensure it only gets triggered once Vault routes a request which needs it to your plugin.

heliobmartins commented 1 year ago

I don't suppose you would have any example for this scenario would you? Or even pointing to any documentation that perhaps could help me attempt this migration?

heliobmartins commented 1 year ago

I did some other tests Today, and I noticed that if we invoke those vault requests using the CLI version of 1.11.7 for example, the initialisation of the plugin succeeds.

I'm attaching the logs with the difference between the initialisation using vault cli 1.11.7 and vault cli 1.12.3. Note that we have just changed the CLI version, we are still trying to build Vault with 1.12.3 version. logs-vault-cli-1-11-7.txt logs-vault-cli-1-12-3.txt

The following logs does seem interesting:

2023-02-17T02:35:45.964Z [INFO]  auth.slauth.auth_slauth_ec4c2744.slauth: configuring client automatic mTLS
2023-02-17T02:35:46.006Z [INFO]  auth.slauth.auth_slauth_ec4c2744.slauth.slauth: configuring server automatic mTLS: timestamp=2023-02-17T02:35:46.006Z

I wonder if there is anything we need to do to handle this auto mTLS thing. @fairclothjm, is there any documentation guiding on how to do that?

heliobmartins commented 1 year ago

One more update:

Max, I was able to perform the test (disregarding the docker compose) by doing:

  1. Created folder /tmp/vault/plugins
  2. Created vault.hcl with plugin_directory = "/private/tmp/vault/plugins" (private seems to be required for MacOS)
  3. Copied the plugin binary to /tmp/vault/plugins
  4. Started Vault with: VAULT_LOG_LEVEL=trace vault server -dev -dev-root-token-id="root" -config=/tmp/vault.hcl
  5. Extracted the SHASUM with SHASUM=$(shasum -a 256 "/tmp/vault/plugins/slauth" | cut -d " " -f1)
  6. Registered the plugin with: vault plugin register -sha256="${SHASUM}" -command="slauth" auth slauth
  7. Enabled AppRole vault auth enable approle
  8. Created Slauth AppRole role vault write auth/approle/role/slauth token_ttl=24h
  9. Extracted ROLE_ID with: ROLE_ID=$(vault read auth/approle/role/slauth/role-id -format=json | jq -r .data.role_id)
  10. Extracted SECRET_ID with: SECRET_ID=$(vault write -f auth/approle/role/slauth/secret-id -format=json | jq -r .data.secret_id)
  11. Tried to enable the plugin vault auth enable -options=app_role_role_id=${ROLE_ID} -options=app_role_secret_id=${SECRET_ID} -options='vault_addr=http://127.0.0.1:8200' slauth.

The problem still persists. :(

One of my teammates suggested trying to get a crash dump where it is hanging. See the coroutines crash logs. Not sure if this will help though. goroutines-log.txt

At this stage, I'm not sure what else I should try.

maxb commented 1 year ago

Here is a toy example of what I'm suggesting you do: https://github.com/maxb/simple-vault-plugin/blob/main/main.go

fairclothjm commented 1 year ago

@heliomartins

I wonder if there is anything we need to do to handle this auto mTLS thing

I think @maxb is right on the money here. I would suggest trying out his proposed changes to your plugin. Maybe you could move your initialization code to the InitializeFunc which you would pass to the framework.Backend in your Factory function. That might remove the need for the sync.Once protection and simplify things for your refactor.

Max is correct that plugins should not be doing any initialization with external dependencies in the Factory function. I will also add that having a dependency on MetadataMode in your plugin is not recommended. If Vault ever removes the concept of MetadataMode then you will again run into issues.

@maxb Thank you for all the help here!

maxb commented 1 year ago

I updated my example to use InitializeFunc rather than sync.Once (https://github.com/maxb/simple-vault-plugin/commit/0712bbc2035a33411fcd56612089962c61244705).

I went with sync.Once because I was paralleling the way that framework.Backend handles initialization of its path regexes, but it appears that was initially written long ago before InitializeFunc was added, so I guess that's why it doesn't rely on that functionality.

heliobmartins commented 1 year ago

Hello folks, I hope you all had a good weekend.

@maxb, thank you very much for sharing this snippet. I'm going to have a look into this. In the meanwhile, when you or @fairclothjm get the chance, do you mind elaborating a little bit more why we should not be using Metadata mode? I'm not 100% sure I understand the impact of using it. I'm afraid we don't have much context why it was created like that. Maybe an old issue on Vault that was later resolved?

Thanks once again for assisting with this.

heliobmartins commented 1 year ago

Hello @maxb and @fairclothjm, unfortunately we were still not able to successfully enable the auth plugin, even after trying to refactor the application to use InitializeFunc.

I've just pushed a simplified version of the plugin to GitHub (https://github.com/heliobmartins/vault-context-deadline) that should allow you to reproduce the problem. The README.md contains the steps to reproduce the issue.

Also, you should be able to see the Pull Request I raised to use the InitializeFunc (https://github.com/heliobmartins/vault-context-deadline/pull/1) which fails with the same error.

I hope the repository help you to reproduce the error. Helio

maxb commented 1 year ago

do you mind elaborating a little bit more why we should not be using Metadata mode?

I believe the point is that it is unclear whether the environment variable to signify metadata mode can be considered a public API or not, and thus, whether there is a commitment to it continuing to exist and/or behave the same way in the future.

I've just pushed a simplified version of the plugin to GitHub (https://github.com/heliobmartins/vault-context-deadline) that should allow you to reproduce the problem.

Ah, now it is really getting interesting. Experimenting with this, a few things become clear:

These observations strongly hint at lock contention within Vault. I suspect some change in Vault 1.12 has introduced additional lock contention so that it is now impossible to make other requests to Vault whilst an auth backend is being enabled.

maxb commented 1 year ago

A bit of git bisect idenfies #16157 as the culprit - which adds taking a read-lock on Core.authLock to pretty much every Vault request!

ncabatoff commented 1 year ago

Nice sleuthing @maxb! I agree with your analysis, though I'm a bit hesitant to vilify #16157. As you say now pretty much all requests need to grab that lock, but even before that change many requests would, e.g. pathInternalUIMountsRead is invoked for most kv cli operations to determine whether the mount is kvv1 or kvv2. I'm hoping that maybe we can change the mounting functions to release the lock before they call Initialize, rather than revisit DetermineRoleFromLoginRequest.

maxb commented 1 year ago

No vilification intended - just the observation that that's the commit at which the behaviour changed.

ncabatoff commented 1 year ago

Right, I just meant that while that PR is the culprit in a sense, insofar as it's made an existing problem worse, I'd prefer to focus on reducing how long we hold the lock than on trying to change what that PR is doing.

maxb commented 1 year ago

@heliobmartins Now that we understand the underlying issue, you might want to give the sync.Once way of initializing that I previously suggested before the discussion of InitializeFunc (https://github.com/maxb/simple-vault-plugin/blob/7593415fe3f85c23da009afa75055c1479dff697/main.go) a go instead. It does have the disadvantage that you have to remember to insert

b.initOnce.Do(b.init)

at the top of all your request handling methods, but it should be enough to defer initialization to first use of the auth method, working around this locking issue.

heliobmartins commented 1 year ago

Hello @maxb, 👋

🤦 I probably should've pushed a reproducible version of the plugin earlier. My bad.

I will try to understand a bit more how I'm going to refactor the plugin to use the sync.Once suggestion and I will get back to you once I get the result.

Now that you better understand the issue, do you think that this is a bug that was introduced in Vault core? Or do you still think that this is related to our plugin initialisation? Because if that's the latter, I reckon I probably need to raise a ticket in our backlog to improve it as soon as possible, because ideally, we should be following the best practices for Plugin Development anyway.

Thanks once again for all your help. I will get back to you soon.

heliobmartins commented 1 year ago

Hello @maxb, just a quick follow-up here, we've also tried to implement the sync.Once approach and unfortunately it didn't work either. 😞 I tried with 2 different approaches and none of them worked. :(

Adding the reference for the changes here. https://github.com/heliobmartins/vault-context-deadline/pull/2 https://github.com/heliobmartins/vault-context-deadline/pull/3

What are the potential consequences if we actually try to add a new context prior invoking the initialisation? Example: https://github.com/heliobmartins/vault-context-deadline/pull/4/files

/\ With the above, we were actually able to load the plugin at least:

~ ❯ vault auth enable -options=app_role_role_id=${ROLE_ID} -options=app_role_secret_id=${SECRET_ID} -options='vault_addr=http://127.0.0.1:8200' slauth
Success! Enabled slauth auth method at: slauth/
maxb commented 1 year ago

Please refer back to https://github.com/maxb/simple-vault-plugin/blob/main/main.go - the whole point here is that you do not trigger the initialization anywhere that gets called during Factory, and only trigger it later when the plugin receives its first request. Both your 2 and 3 end up triggering the initialization during Factory.

In your 4 you have created a race condition, as you now have no guarantee initialization has actually completed before the first actual request to your plugin occurs.

Now that you better understand the issue, do you think that this is a bug that was introduced in Vault core? Or do you still think that this is related to our plugin initialisation? Because if that's the latter, I reckon I probably need to raise a ticket in our backlog to improve it as soon as possible, because ideally, we should be following the best practices for Plugin Development anyway.

Vault never documented either way, whether it was allowed to depend upon the Vault API during enabling a backend. You relied on this, but Vault has subsequently, accidentally, introduced lock contention which makes things break. Is this is a bug in Vault? I think it is unclear whether it is or not, as the plugin development docs do not rigorously document what a plugin author may or may not rely upon.

I do think there are some issues in your plugin that would ideally addressed:

1) Even without a rigorous specification, it is heavily implied by the design of existing plugins, that the Factory function should avoid any initialization that can fail. It should just be wiring up objects.

2) Your plugin is currently not calling b.Backend.Setup to pass the logical.BackendConfig on to your framework.Backend. This may result in some features of the framework.Backend not working.

3) You have login/slauthtoken declared as an unauthenticated path, despite just plain login being the only path that is implemented.

4) The assumption that the plugin only ever needs to log in to Vault once ever is not very safe. Ideally it would be resilient to various kinds of potential failure: The initial login attempt could fail for some reason. The token could expire, or be revoked. In either of these cases, logging in again would make sense. (Note that implementing this means moving away from sync.Once to a more flexible synchronization primitive, probably sync.RWMutex. I only suggested sync.Once as a least-effort path to overcoming the immediate problem.)

5) Storing the Vault token in Vault storage does not make sense to me. It means more read/write to Vault storage, which means more gRPC calls back and forth between the main Vault server and your plugin. I'd just keep it in (properly atomically accessed) memory. A new token can be acquired by logging in again after the plugin is restarted.

6) I previously remarked in this issue thread about storing the approle credentials in plugin options being undesirable. They should be moved to a dedicated config endpoint within the plugin, mirroring how many existing auth methods (e.g. LDAP and Kubernetes) handle their configuration.

heliobmartins commented 1 year ago

Hey @maxb, 👋

Thanks for all your input.

  1. Even without a rigorous specification, it is heavily implied by the design of existing plugins, that the Factory function should avoid any initialization that can fail. It should just be wiring up objects.
  2. Your plugin is currently not calling b.Backend.Setup to pass the logical.BackendConfig on to your framework.Backend. This may result in some features of the framework.Backend not working.

Does Hashicorp provide any documentation to follow this best practices? I mean, I couldn't even find any reference for logical.BackendConfig and how to use what you're suggesting here.

  1. You have login/slauthtoken declared as an unauthenticated path, despite just plain login being the only path that is implemented.

As I mentioned in my comment https://github.com/hashicorp/vault/issues/19192#issuecomment-1436280530 this is not the real plugin, it is actually missing a lot of things there... What I pushed is just a reproducible version of the problem we are trying to investigate. The Paths are not meant to be relevant, since the whole problem we are trying to identify is the auth enable. If it can't be enabled, it doesn't even make sense to have the paths.

  1. The assumption that the plugin only ever needs to log in to Vault once ever is not very safe. Ideally it would be resilient to various kinds of potential failure: The initial login attempt could fail for some reason. The token could expire, or be revoked. In either of these cases, logging in again would make sense. (Note that implementing this means moving away from sync.Once to a more flexible synchronization primitive, probably sync.RWMutex. I only suggested sync.Once as a least-effort path to overcoming the immediate problem.)

We are making usage of sync.RWMutex for some of these paths. Example for the real login/slauthToken

const PathLoginSlauthToken = "login/slauthtoken" //#nosec
const ParamSlauthToken     = "slauthToken"

func (b *AuthBackend) backendPaths() []*framework.Path {
    authLoginUserOperation := framework.PathOperation{
        Callback: b.withMetadataModeCheck(b.pathAuthLoginUser),
    }

    return []*framework.Path{
        {
            Pattern: PathLoginSlauthToken,
            Fields: map[string]*framework.FieldSchema{
                ParamSlauthToken: {
                    Type:        framework.TypeString,
                    Description: "SLAuth token to authenticate",
                },
            },
            Operations: map[logical.Operation]framework.OperationHandler{
                logical.CreateOperation: &authLoginUserOperation,
            },
        }
}

func (b *AuthBackend) pathAuthLoginUser(ctx context.Context, req *logical.Request, d *framework.FieldData) (*logical.Response, error) {
    defer b.tokenRenewLock.BeginRead()()
    slauthToken := d.Get(ParamSlauthToken).(string)
    //Handles the LoginUser Login (not added to the snippet)

    return &logical.Response{
        Data: Data{"identity": entity.Id},
        Auth: &logical.Auth{
            EntityID:    entity.Id,
            Period:      userAuth.ttl,
            TokenType:   logical.TokenTypeBatch,
            DisplayName: identityData.Name,
        },
    }, nil
}

func (b *ReadWriteLock) BeginRead() func() {
    // begin reading
    b.rwMutex.RLock()
    return func() {
        // end reading
        b.rwMutex.RUnlock()
    }
}

In regards to token expiration, etc... this isbeing handled by our plugin, through a set of methods. (Maybe this is not ideal, but I'm not 100% sure what was the motivation on doing like that. Unfortunately, I didn't belong to the team responsible for the plugin when it was implemented.

func (b *AuthBackend) doTokenRenew() error {
    b.Logger().Debug("checking if token renewal needed")
    return b.tokenRenewLock.ExecWrite(func() error {
        needsRenewal, err := b.checkIfVaultLinkTokenNeedsRenewal()
        if err != nil {
            return errors.Wrapf(err, "failed to check if token needs renewal")
        }
        b.Logger().Debug("token needs renewal", "needsRenewal", needsRenewal)
        if needsRenewal {
            return b.vaultSysLink.RenewToken()
        }
        return nil
    })
}
  1. Storing the Vault token in Vault storage does not make sense to me. It means more read/write to Vault storage, which means more gRPC calls back and forth between the main Vault server and your plugin. I'd just keep it in (properly atomically accessed) memory. A new token can be acquired by logging in again after the plugin is restarted.

Maybe this is something our team needs to revisit... We probably need to understand more what's being done here.

  1. I previously remarked in this issue thread about storing the approle credentials in plugin options being undesirable. They should be moved to a dedicated config endpoint within the plugin, mirroring how many existing auth methods (e.g. LDAP and Kubernetes) handle their configuration.

This is also probably we might need to understand better. I might not have all the context here.

I'm going to re-try the sync.Once suggestion, but I'm not 100% sure if this is going to work as expected due to the complexity of our plugin. It would be nice to understand if this extra lock that was added to vault/core will remain there, or if it will be removed in upcoming versions. Otherwise, I'm afraid that we are probably gonna need to do a lot of refactor in our plugin to overcome this problem and to be able to bump to any version greater than 1.11.x.

Once again, thanks for your support in trying to assist here.

maxb commented 1 year ago

Does Hashicorp provide any documentation to follow this best practices?

I don't think so... everything I know about this, I have learnt by reading the source code - both that of the plugin SDK, and all the built in plugins within the Vault source tree.

It would be nice to understand if this extra lock that was added to vault/core will remain there, or if it will be removed in upcoming versions.

@ncabatoff has indicated in earlier comments in this issue that the currently preferred direction is to mostly leave it as is, but try to reduce the length of time it is being held during the auth enable operation, so that it is no longer being held during the InitializeFunc.

Regardless, moving your plugin's initialization to happen on first request via sync.Once should enable you to sidestep the issue anyway, and not be dependent on this change.

heliobmartins commented 1 year ago

Hello @maxb 👋

I ended up spending some time trying to refactor the plugin to use sync.Once, and from what I can tell, it seems that the plugin was able to load successfully at least. Our integration and unit tests are working as expected after the refactor.

I just have a couple of follow up questions after making the change...

  1. Based on your example, what would you do if during the initialisation of your plugin it returns an error? https://github.com/maxb/simple-vault-plugin/blob/main/main.go#L20-L23. As you're aware our plugin does perform a lot of initialisation which could go wrong. In that case, should we just panic? Example:
func (b *AuthBackend) init() {
    err := b.Init()
    if err != nil {
        log.Panic("Failed plugin initialisation: ", err) //should we be panicking here? Is this the recommend/best approach?
    }
}

func (b *AuthBackend) Init() error {

// handle all initialisation
        ctx := context.Background()
    if b.VaultSysLink == nil {
        sysLink, err := util.NewVaultSysLink(ctx, b.conf, clientFactory)
        if err != nil {
            return errors.Wrapf(err, "failed to obtain sys link to vault backend")
        }
        b.VaultSysLink = sysLink
    }
    if b.SlauthTokenAuth == nil {
        authenticator, err := initSlauthTokenAuth(b.conf)
        if err != nil {
            return errors.Wrapf(err, "failed to init slauthtoken authenticator component")
        }
        b.SlauthTokenAuth = authenticator
    }
//
}
  1. During the refactor, I was receiving a context canceled response. I noticed later that the problem was because I was passing the ctx parameter from Factory to AuthBackend. To overcome the issue, we created a new context in the init function, like shown in previous snippet. I don't think we are gonna face any race condition in that case, right? Would that be the recommend approach in that case, or should we be using the context from anywhere else?
  2. When you say: "I don't think so... everything I know about this, I have learnt by reading the source code - both that of the plugin SDK, and all the built in plugins within the Vault source tree." Would you be able share some points on the source code that it would be relevant for us to start reading more about plugin usage/best practices?
  3. @ncabatoff has indicated in earlier comments in this issue that the currently preferred direction is to mostly leave it as is, but try to reduce the length of time it is being held during the auth enable operation, so that it is no longer being held during the InitializeFunc.

I'm not sure I understand this part. Does this means we (plugin developers) can actually control the timeout of something through the InitializeFunc?

  1. You raised a point before that we should not be storing the the token in conf.StorageView. Would you be able to share why? Or, at least provide some guidance for us to understand why this is wrong?
  2. When it comes to the communication between the plugin and vault, my understanding is that we are using a client https://github.com/heliobmartins/vault-context-deadline/blob/main/vault_client.go#L31-L49 that forwards HTTP requests. Is that the recommended approach? Or should we be looking into something else?

Once again, thanks so much for your help.

maxb commented 1 year ago
  1. what would you do if during the initialisation of your plugin it returns an error?

The sync.Once method means you have no option to implement any error recovery. If there is an error, your plugin will be permanently broken until a human administrator reloads it. Or, panicking might result in your plugin exiting (though I am not sure if that applies during a gRPC request handler... would need to test that) in which case Vault would try to restart it.

That's clearly not great, but it's the error handling situation you already had with your previous implementation, so I proposed the smallest possible change to fix the issue without further refactoring.

Longer term, you would replace sync.Once with sync.RWMutex, and implement more sophisticated code to retry initialization later when in an error state.

2. To overcome the issue, we created a new context

Contexts are being used here mostly to control the scope of cancellation. Reverting to context.Background() (which is not a new context, it's an ever-present singleton) is the right thing to do when performing background operations which should not be cancelled even if their parent request is.

3. Would you be able share some points on the source code that it would be relevant for us to start reading more about plugin usage/best practices?

Um, not really. Or, all of it? You can't really summarize general familiarization with a few specific points.

4. I'm not sure I understand this part.

If Vault itself only held the relevant lock for certain critical sections of enabling a new auth method, and not whilst its InitializeFunc was running, then the issue would be avoided.

5. You raised a point before that we should not be storing the the token in conf.StorageView.

The token is a short-lived credential which can (presumably?) be replaced at any time by doing another AppRole login, yet you are writing it to clustered network storage (the Vault store). It's not exactly wrong, but it's an inefficient choice, for which I can't see an obvious reason.

6. When it comes to the communication between the plugin and vault, my understanding is that we are using a client https://github.com/heliobmartins/vault-context-deadline/blob/main/vault_client.go#L31-L49 that forwards HTTP requests.

When it comes to communication between Vault and any Vault plugin, the communication necessary to actually integrate with Vault as a plugin is via gRPC. But your plugin is a bit unusual, as it interacts with Vault both as a plugin (via gRPC), and as a regular client of the Vault API (via HTTP). That's fine, and there are no alternatives - Vault does not provide any special way for plugins to interact with the regular Vault API, so the rare plugins that need that, have to interact with it in two ways.

But, you're not forwarding HTTP requests from another source - you're just making them.

heliobmartins commented 1 year ago

Hello again Max,

Once again, thank you very much for providing such assistance on this issue.

"Contexts are being used here mostly to control the scope of cancellation. Reverting to context.Background() (which is not a new context, it's an ever-present singleton) is the right thing to do when performing background operations which should not be cancelled even if their parent request is."

✅ In that case I assume we are handling it correct. 🙌🏽

"Um, not really. Or, all of it? You can't really summarize general familiarization with a few specific points."

Alright, no worries! I understand it can be hard. What I was mainly asking is if you would be able to point in Vault repository, where should we start reading to get to know more about plugin implementations. But don't worry about this.

"The token is a short-lived credential which can (presumably?) be replaced at any time by doing another AppRole login, yet you are writing it to clustered network storage (the Vault store). It's not exactly wrong, but it's an inefficient choice, for which I can't see an obvious reason."

❓ In that case, it would be more efficient to just issue Vault Tokens using the App Roles based on requests? Maybe we could reduce the TTL of these batch tokens, instead of have to handle operations such as Token Renew, etc? Is this how we should be handling it?

"But, you're not forwarding HTTP requests from another source - you're just making them."

✅ Alright! So it is fine and it is not wrong what we are doing. Thanks

We'll hopefully be rolling out the changes to Production Today. I will follow-up with the final result. 🙏🏾

heliobmartins commented 1 year ago

Hello again @maxb ,

As promised, we released the changes to Prod and it looks good so far. Thank you once again for all your support throughout this issue.

In regards to the Github issue, I will let you guys decide if this extra lock that was added should be considered as an issue or not. =)