hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.77k stars 4.17k forks source link

context canceled Error when using cassandra database plugin #20169

Open Albert-W opened 1 year ago

Albert-W commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

[ERROR] UnexpectedError: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
</body>
</html>
, on post https://{baseurl}/v1/database/config/hydro-cassandra-db

Vault is logging the below error:

[ERROR] core: forward request error: error=\"error during forwarding RPC request\"      
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Canceled desc = context canceled"

Vault configuration

  - type: database
    path: database
    description: General Vault Database secrets engine
    options:
      default_lease_ttl: 720h
      max_lease_ttl: 720h
    config:
      cassandra_auth: {{CASSANDRA_AUTH}}
      engine_options:
        name: {{VAULT_CASSANDRA_DB_PLUGIN_NAME}}
        plugin_name: cassandra-database-plugin
        hosts: {{CASSANDRA_HOST}}
        username: ***
        password: ***
        protocol_version: 4
        allowed_roles:
          - "*"
        connect_timeout: 120s
        skip_verification: true
        consistency: LOCAL_QUORUM
        local_datacenter: {{CASSANDRA_DATACENTER}}
        tls: {{CASSANDRA_TLS}}
        insecure_tls: false
        tls_server_name: {{CASSANDRA_TLS_SERVER_NAME_TO_VERIFY}}
        pem_json: "{\"ca_chain\":{{CA_CHAIN}}}"

Expected behavior Expected the action to be success.

Environment: vault 1.13.0 Failed. vault 1.12.2 Failed. vault 1.6.0 Succeed.

hghaf099 commented 1 year ago

Welcome to HashiCorp Vault, and thanks for filing this issue. Would you please provide us with the Vault configuration and and steps to reproduce the issue. Also, I am wondering if you could verify that Cassandra was reachable at the time of the issue?

Albert-W commented 1 year ago

Hi @hghaf099, Here is my user case. I have a Cassandra cluster running in an AWS account with a public load-balancer. And I have a vault running in a different AWS account under an api-gateway. we get this error when we are trying to execute a similar command like:

vault write database/config/my-cassandra-database \
    plugin_name="cassandra-database-plugin" \
    hosts=<host> \
    port=<port> \
    protocol_version=4 \
    username=<username> \
    password=<password>\
    allowed_roles=* \
    consistency=LOCAL_QUORUM \
    connect_timeout=30s

It works well with vault 1.6.0, but failed with vault 1.12.2 and 1.13.0

heatherezell commented 1 year ago

Can you change the connect timeout to 60 or 90s to see if that helps at all?

Albert-W commented 1 year ago

Sure, I update it into 3000s, but still failed. Error writing data to database/config/yichang-cassandra-database: context deadline exceeded

using vault monitor -log-level=debug, get the following message 2023-05-10T15:34:45.327Z [DEBUG] secrets.database.database_f4c1337b: got database plugin instance: type=cassandra

only one line related.

maxb commented 1 year ago

Try adding verify_connection=false to your

vault write database/config/my-cassandra-database \

If that makes it "work", you will have proved that the problem is that Vault is unable to successfully connect to your Cassandra database, and you will have to debug your network setup and Cassandra to determine why that is the case.

Albert-W commented 1 year ago

the connection is definitely working, because I can read the creds.

Key                Value
---                -----
lease_id           database/creds/my-role/GeSBplzOxjuknJzvNo3J0srr
lease_duration     8m
lease_renewable    true
password           oWToydMwaZc1qD-9rrTn
username           v_root_my_role_8oxsqvg6ytjlknq1khrq_xxxxx

but the error persists when I update the configuration.

Add "verify_connection=false" will improve the configuration, but I will get an error when read roles.

sh-4.2$ vault read database/creds/my-role
Error reading database/creds/my-role: context deadline exceeded
maxb commented 1 year ago

OK, so, the connection was working using the configuration that had previously been set in Vault:

the connection is definitely working, because I can read the creds.

Key                Value
---                -----
lease_id           database/creds/my-role/GeSBplzOxjuknJzvNo3J0srr
lease_duration     8m
lease_renewable    true
password           oWToydMwaZc1qD-9rrTn
username           v_root_my_role_8oxsqvg6ytjlknq1khrq_xxxxx

but the error persists when I update the configuration.

But the new configuration you were trying to apply is in some way broken, such that Vault times out connecting to Cassandra when using it.

Add "verify_connection=false" will improve the configuration, but I will get an error when read roles.

By adding verify_connection=false you were able to update the configuration...

sh-4.2$ vault read database/creds/my-role
Error reading database/creds/my-role: context deadline exceeded

... and so now, generating new credentials no longer works, because Vault times out connecting to Cassandra.

Everything you've shared is pointing at the configuration you're trying to set in Vault being incorrect. You need to investigate that. It doesn't look like an issue with Vault itself.

heatherezell commented 1 year ago

Thank you, @maxb - I agree completely. This may be a question better suited to our Discuss forum. @Albert-W, please consider closing this issue and posting it there. Thanks! :)

Albert-W commented 1 year ago

Let me put it in a sequence, so that it's easy to understand.

  1. I create a configuration:
vault write database/config/yichang-cassandra-database \
plugin_name="cassandra-database-plugin" \
hosts="cassandra-us-east-1.cerberus.io" \
port=9042 \
protocol_version=4 \
username="vault" \
password="ioWN9mdYcDbh9y39EviEriX/JmO7rOAqxxxxxxxxxx" \
allowed_roles=my-role \
consistency="LOCAL_QUORUM" \
connect_timeout="60s" \
tls=true \
pem_json="{\"ca_chain\":$ca_chain}" \
  1. It failed with an error Error writing data to database/config/yichang-cassandra-database: context deadline exceeded, but the configuration is created.
  2. I created a role.
  3. I can read creds from the role. vault read database/roles/my-role
  4. I update the timeout="70s"
    vault write database/config/yichang-cassandra-database \
    plugin_name="cassandra-database-plugin" \
    hosts="cassandra-us-east-1.cerberus.io" \
    port=9042 \
    protocol_version=4 \
    username="vault" \
    password="ioWN9mdYcDbh9y39EviEriX/JmO7rOAqxxxxxxxxxx" \
    allowed_roles=my-role \
    consistency="LOCAL_QUORUM" \
    connect_timeout="70s" \
    tls=true \
    pem_json="{\"ca_chain\":$ca_chain}" \

    same thing happens.

The point is that the configuration is working, but the command returns an error.

heatherezell commented 1 year ago

The best guess I can make with this information is that the connection to cassandra is working, but the return trip to report that to Vault is not. I would start by doing some packet tracing and other network troubleshooting for connections to and from your Vault to your cassandra instance.

Albert-W commented 1 year ago

The same code has being running for a while (more than a year). When I am using vault 1.6.0, all is fine, when I upgrade vault to 1.12.2 or 1.13.0, it begins to fail. I tested this for a couple times.

heatherezell commented 1 year ago

Okay, thanks for that info. I'll check with some folks who are more experienced with cassandra. :)

maxb commented 1 year ago

There's something really odd going on here... If you're still seeing the error when writing the config with verify_connection=false, then that operation isn't connecting to Cassandra at all. It should be just a simple write to Vault storage.

With that extra information, my understanding of the problem completely changes... The problem is, it's now firmly into "this shouldn't be possible" territory.

The only things I can think of to suggest are general "something weird is happening" debugging options:

Albert-W commented 1 year ago

Hi @maxb , thanks, set verify_connection=false will succeed. without the setting, I dump the stacks, I get the traces

2023-05-11T08:08:07.383Z [DEBUG] secrets.database.database_0e42fb14: created database object: name=yichang-cassandra-database plugin_name=cassandra-database-plugin
2023-05-11T08:08:42.321Z [DEBUG] secrets.database.database_0e42fb14: got database plugin instance: type=cassandra
2023-05-11T08:09:00.055Z [DEBUG] core.cluster-listener: performing server cert lookup
2023-05-11T08:09:00.122Z [DEBUG] core.request-forward: got request forwarding connection

the creds generated by it is working.

Albert-W commented 1 year ago

When reading the creds, it will first fail with "context deadline exceeded", but it can by fixed by vault write -force /sys/leases/revoke-force/database/creds/my-role. Here are the full commands.

sh-4.2$ vault read database/creds/my-role
Error reading database/creds/my-role: context deadline exceeded
sh-4.2$ vault write -force /sys/leases/revoke-force/database/creds/my-role
Success! Data written to: sys/leases/revoke-force/database/creds/my-role
sh-4.2$ vault read database/creds/my-role
Key                Value
---                -----
lease_id           database/creds/my-role/TBYEhJ0Wm6O20jXbCxTksik8
lease_duration     8h
lease_renewable    true
password           PkZjRaHRylIB47-UCtWL
username           v_root_my_role_tocakq3ip9ris7kzlbbs_1683794198
maxb commented 1 year ago

Hi @maxb , thanks, set verify_connection=false will succeed. without the setting, I dump the stacks, I get the traces

This seems to contradict what you said earlier.

I am sorry, but due to too much conflicting information given, I no longer have any idea what the actual problem is, and don't expect to be able to help further.

Albert-W commented 1 year ago

sorry for the confusion. When I paste the commands, I accidentally paste the wrong command, here is the updated version: https://github.com/hashicorp/vault/issues/20169#issuecomment-1542706527