vernemq / vernemq

A distributed MQTT message broker based on Erlang/OTP. Built for high quality & Industrial use cases. The VerneMQ mission is active & the project maintained. Thank you for your support!
https://vernemq.com
Apache License 2.0
3.22k stars 394 forks source link

[Bug]: SSL Validation Error with Diversity - Postgresql starting 2.0.0 #2282

Closed cambrosch closed 3 months ago

cambrosch commented 4 months ago

Environment

Current Behavior

Running the exactly identical docker Parameters as from 1.13.0, after upgrading to 2.0.0, vmq diversity cannot connect to our postgresql server via SSL (hosted in Azure), see error in log.

A downgrade back to 1.13.0 with the same parameters fixed the issue. Validating the certificate chain using pgadmin (mode: verify-full) showed no issues with SSL.

Expected behaviour

Connecting to this sql server should not result in a validation error.

Configuration, logs, error output, etc.

Error in Console:

2024-05-02T08:58:15.702041604Z 2024-05-02T08:58:15.699735+00:00 [error] <0.628.0> gen_server:error_info/8:1391: Generic server <0.628.0> terminating. Reason: {ssl_negotiation_failed,{options,incompatible,[{verify,verify_peer},{cacerts,undefined}]}}. Last message: {command,epgsql_cmd_connect,#{port => 5432,ssl => true,host => "servername-removed.postgres.database.azure.com",password => #Fun<epgsql_cmd_connect.0.87005817>,database => "emili",username => "psql",ssl_opts => []}}. State: {state,undefined,undefined,<<>>,undefined,on_message,undefined,{[],[]},undefined,undefined,undefined,undefined,[],information_redacted,[],undefined,undefined,undefined,undefined,undefined}. Client <0.618.0> stacktrace: [{gen,do_call,4,[{file,"gen.erl"},{line,240}]},{gen_server,call,3,[{file,"gen_server.erl"},{line,415}]},{epgsql,call_connect,2,[{file,"/opt/vernemq/_build/default/lib/epgsql/src/epgsql.erl"},{line,207}]},{vmq_diversity_worker_wrapper,handle_info,2,[{file,"/opt/vernemq/apps/vmq_diversity/src/vmq_diversity_worker_wrapper.erl"},{line,176}]}].
2024-05-02T08:58:15.703000270Z 2024-05-02T08:58:15.699712+00:00 [error] <0.626.0> gen_server:error_info/8:1391: Generic server <0.626.0> terminating. Reason: {ssl_negotiation_failed,{options,incompatible,[{verify,verify_peer},{cacerts,undefined}]}}. Last message: {command,epgsql_cmd_connect,#{port => 5432,ssl => true,host => "servername-removed.postgres.database.azure.com",password => #Fun<epgsql_cmd_connect.0.87005817>,database => "emili",username => "psql",ssl_opts => []}}. State: {state,undefined,undefined,<<>>,undefined,on_message,undefined,{[],[]},undefined,undefined,undefined,undefined,[],information_redacted,[],undefined,undefined,undefined,undefined,undefined}. Client <0.616.0> stacktrace: [{logger_config,allow,2,[{file,"logger_config.erl"},{line,64}]},{vmq_diversity_worker_wrapper,handle_info,2,[{file,"/opt/vernemq/apps/vmq_diversity/src/vmq_diversity_worker_wrapper.erl"},{line,181}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]}].
2024-05-02T08:58:15.703784054Z 2024-05-02T08:58:15.700569+00:00 [warning] <0.616.0> vmq_diversity_worker_wrapper:handle_info/2:181: Could not connect to postgresql due to {ssl_negotiation_failed,{options,incompatible,[{verify,verify_peer},{cacerts,undefined}]}}

Postgre-related docker environment parameters:

DOCKER_VERNEMQ_VMQ_DIVERSITY__AUTH_POSTGRES__ENABLED = on
DOCKER_VERNEMQ_VMQ_DIVERSITY__POSTGRES__HOST = servername-removed.postgres.database.azure.com
DOCKER_VERNEMQ_VMQ_DIVERSITY__POSTGRES__PORT = 5432
DOCKER_VERNEMQ_VMQ_DIVERSITY__POSTGRES__USER = psql
DOCKER_VERNEMQ_VMQ_DIVERSITY__POSTGRES__SSL = on
DOCKER_VERNEMQ_VMQ_DIVERSITY__POSTGRES__PASSWORD = removed
DOCKER_VERNEMQ_VMQ_DIVERSITY__POSTGRES__DATABASE = removed
DOCKER_VERNEMQ_PLUGINS__VMQ_DIVERSITY = on
DOCKER_VERNEMQ_LISTENER__SSL__DEFAULT = 0.0.0.0:8883
DOCKER_VERNEMQ_LISTENER__SSL__CAFILE = /etc/ssl/ca.pem
DOCKER_VERNEMQ_LISTENER__SSL__CERTFILE = /etc/ssl/cert.pem
DOCKER_VERNEMQ_LISTENER__SSL__KEYFILE = /etc/ssl/key.pem
DOCKER_VERNEMQ_LISTENER__SSL__TLS_VERSION = tlsv1.3
DOCKER_VERNEMQ_LISTENER__SSL__REQUIRE_CERTIFICATE = off

Postgresql server is set to: min SSL version: TLS 1.2 max SSL version TLS 1.3

Code of Conduct

ioolkos commented 4 months ago

@cambrosch I think this comes from new SSL requirements in OTP 26 (which used for 2.0.0 while 1.3.0 is based on OTP 25). But it's a good catch. It seems in 26, SSL wants to explicitly know about the CA chain.

Can you try setting vmq_diversity.postgres.cafile in vernemq.conf? Maybe pointing to the system CA certs is enough (/etc/ssl/certs/ca-certificates.crt), maybe not...


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

cambrosch commented 4 months ago

That sadly doesn't work, I can't add it via DOCKER_VERNEMQ_VMQ_DIVERSITYPOSTGRESCAFILE as that throws an Error generating Config with cuttlefish, and I also can't manually override the config file in the docker container, I tried that in several configurations but if I change it manually, as soon as I restart vernemq it gets overridden, and if I mount a drive to save the config file, it wipes the docker container, and refuses to work for one reason or another. That's a separate issue, but probably not one I can quickly fix :/

ioolkos commented 4 months ago

I think you can mount a conf.local file and when the Docker image finds this, it takes that conf file as a full replacement. /etc/vernemq/vernemq.conf.local. But this will not solve the issue here. An error generating the config is usually a wrong setting name. But yours looks correct :(


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ioolkos commented 4 months ago

@cambrosch do you see the Cuttlefish config error printed to you console when you run the Docker image in the foreground?


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

cambrosch commented 4 months ago
2024-05-02T10:15:08.68071  Connecting to the container 'vernemq'...
2024-05-02T10:15:08.70573  Successfully Connected to container: 'vernemq' [Revision: 'vernemq--jamr5ej-5dfd4d78dc-4khc5', Replica: 'vernemq--jamr5ej']
2024-05-02T10:15:10.703798696Z Error generating config with cuttlefish
2024-05-02T10:15:10.703850738Z   run `vernemq config generate -l debug` for more information.
ioolkos commented 4 months ago

@cambrosch are you able to attach to the container and run vernemq config generate -l debug? This should print out the actual config problem.


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

cambrosch commented 4 months ago

Sadly, the container immediately crashes upon getting this message, so I cannot attach a console :/

ioolkos commented 4 months ago

I just tested this with a docker run, feeding it an example.env file with Postgres configs similar to yours. This initially complained about whitespaces around the ='s in the env file, but other than that seems to work, at least no complaints generating the config. I'm not sure how you run the Docker image, though.


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

cambrosch commented 4 months ago

Ah, I messed that up. I had /etc/ssl mounted for the MQTT TLS certs, so /etc/ssl/certs/ca-certificates.crt didn't even exist. I re-created that now, and now the config at least boots again. Alas, now I'm on to a new error:

2024-05-02T14:49:19.698008699Z 2024-05-02T14:49:19.697685+00:00 [notice] <0.3694.0> ssl_handshake:path_validation_alert/1:2127: TLS client: In state wait_cert at ssl_handshake.erl:2127 generated CLIENT ALERT: Fatal - Handshake Failure, - {bad_cert,hostname_check_failed}
2024-05-02T14:49:19.698101996Z 2024-05-02T14:49:19.697896+00:00 [warning] <0.616.0> vmq_diversity_worker_wrapper:handle_info/2:181: Could not connect to postgresql due to {ssl_negotiation_failed,{tls_alert,{handshake_failure,"TLS client: In state wait_cert at ssl_handshake.erl:2127 generated CLIENT ALERT: Fatal - Handshake Failure\n {bad_cert,hostname_check_failed}"}}}
2024-05-02T14:49:19.699001414Z 2024-05-02T14:49:19.697927+00:00 [error] <0.3689.0> gen_server:error_info/8:1391: Generic server <0.3689.0> terminating. Reason: {ssl_negotiation_failed,{tls_alert,{handshake_failure,"TLS client: In state wait_cert at ssl_handshake.erl:2127 generated CLIENT ALERT: Fatal - Handshake Failure\n {bad_cert,hostname_check_failed}"}}}. Last message: {command,epgsql_cmd_connect,#{port => 5432,ssl => true,host => "hostname-removed.postgres.database.azure.com",password => #Fun<epgsql_cmd_connect.0.87005817>,database => "removed",username => "psql",ssl_opts => [{cacertfile,"/etc/ssl/certs/ca-certificates.crt"}]}}. State: {state,undefined,undefined,<<>>,undefined,on_message,undefined,{[],[]},undefined,undefined,undefined,undefined,[],information_redacted,[],undefined,undefined,undefined,undefined,undefined}. Client <0.616.0> stacktrace: [{gen,do_call,4,[{file,"gen.erl"},{line,240}]},{gen_server,call,3,[{file,"gen_server.erl"},{line,415}]},{epgsql,call_connect,2,[{file,"/opt/vernemq/_build/default/lib/epgsql/src/epgsql.erl"},{line,207}]},{vmq_diversity_worker_wrapper,handle_info,2,[{file,"/opt/vernemq/apps/vmq_diversity/src/vmq_diversity_worker_wrapper.erl"},{line,176}]}].
2024-05-02T14:49:19.699475025Z 2024-05-02T14:49:19.698600+00:00 [error] <0.3689.0> proc_lib:crash_report/4:584: crasher: initial call: epgsql_sock:init/1, pid: <0.3689.0>, registered_name: [], exit: {{ssl_negotiation_failed,{tls_alert,{handshake_failure,"TLS client: In state wait_cert at ssl_handshake.erl:2127 generated CLIENT ALERT: Fatal - Handshake Failure\n {bad_cert,hostname_check_failed}"}}},[{gen_server,handle_common_reply,8,[{file,"gen_server.erl"},{line,1226}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [<0.616.0>,<0.615.0>,auth_postgres,vmq_diversity_sup,<0.595.0>], message_queue_len: 0, messages: [], links: [<0.616.0>], dictionary: [], trap_exit: false, status: running, heap_size: 10958, stack_size: 28, reductions: 33545; neighbours:
2024-05-02T14:49:19.713975185Z 2024-05-02T14:49:19.713598+00:00 [notice] <0.3698.0> ssl_handshake:path_validation_alert/1:2127: TLS client: In state wait_cert at ssl_handshake.erl:2127 generated CLIENT ALERT: Fatal - Handshake Failure, - {bad_cert,hostname_check_failed}
2024-05-02T14:49:19.714218930Z 2024-05-02T14:49:19.713791+00:00 [warning] <0.620.0> vmq_diversity_worker_wrapper:handle_info/2:181: Could not connect to postgresql due to {ssl_negotiation_failed,{tls_alert,{handshake_failure,"TLS client: In state wait_cert at ssl_handshake.erl:2127 generated CLIENT ALERT: Fatal - Handshake Failure\n {bad_cert,hostname_check_failed}"}}}
2024-05-02T14:49:19.714381329Z 2024-05-02T14:49:19.713796+00:00 [error] <0.3690.0> gen_server:error_info/8:1391: Generic server <0.3690.0> terminating. Reason: {ssl_negotiation_failed,{tls_alert,{handshake_failure,"TLS client: In state wait_cert at ssl_handshake.erl:2127 generated CLIENT ALERT: Fatal - Handshake Failure\n {bad_cert,hostname_check_failed}"}}}. Last message: {command,epgsql_cmd_connect,#{port => 5432,ssl => true,host => "hostname-removed.postgres.database.azure.com",password => #Fun<epgsql_cmd_connect.0.87005817>,database => "removed",username => "psql",ssl_opts => [{cacertfile,"/etc/ssl/certs/ca-certificates.crt"}]}}. State: {state,undefined,undefined,<<>>,undefined,on_message,undefined,{[],[]},undefined,undefined,undefined,undefined,[],information_redacted,[],undefined,undefined,undefined,undefined,undefined}. Client <0.620.0> stacktrace: [{gen,do_call,4,[{file,"gen.erl"},{line,240}]},{gen_server,call,3,[{file,"gen_server.erl"},{line,415}]},{epgsql,call_connect,2,[{file,"/opt/vernemq/_build/default/lib/epgsql/src/epgsql.erl"},{line,207}]},{vmq_diversity_worker_wrapper,handle_info,2,[{file,"/opt/vernemq/apps/vmq_diversity/src/vmq_diversity_worker_wrapper.erl"},{line,176}]}].
2024-05-02T14:49:19.714946319Z 2024-05-02T14:49:19.714359+00:00 [error] <0.3690.0> proc_lib:crash_report/4:584: crasher: initial call: epgsql_sock:init/1, pid: <0.3690.0>, registered_name: [], exit: {{ssl_negotiation_failed,{tls_alert,{handshake_failure,"TLS client: In state wait_cert at ssl_handshake.erl:2127 generated CLIENT ALERT: Fatal - Handshake Failure\n {bad_cert,hostname_check_failed}"}}},[{gen_server,handle_common_reply,8,[{file,"gen_server.erl"},{line,1226}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [<0.620.0>,<0.615.0>,auth_postgres,vmq_diversity_sup,<0.595.0>], message_queue_len: 0, messages: [], links: [<0.620.0>], dictionary: [], trap_exit: false, status: running, heap_size: 10958, stack_size: 28, reductions: 33544; neighbours:
ioolkos commented 4 months ago

Argh, now it's a verification error (the client tries to verify the peer), on the level of Erlang SSL. Need to research this but cannot do it immediately. Maybe also some sort of wildcard server name is the issue.


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ioolkos commented 4 months ago

I'm now suspecting this is the same as https://github.com/vernemq/vernemq/issues/1485 that we had to fix in the MQTT bridge. Are those wildcard certs?


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ioolkos commented 4 months ago

@cambrosch are you still looking into this? is the public cert of the Postgres server a wildcard cert? https://en.wikipedia.org/wiki/Wildcard_certificate


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

cambrosch commented 4 months ago

The Certificate is using Common Name: removedhash.database.azure.com Subject Alternative Names: removedhash.database.azure.com, dev-removed-psql.postgres.database.azure.com Organization: Microsoft Corporation I don't see any wildcard, but also the common name is not the used domain name, that's only listed in alternate names.

cambrosch commented 4 months ago

Also>

depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Ro                                                                               ot CA
verify return:1
depth=1 C = US, O = DigiCert Inc, CN = DigiCert SHA2 Secure Server CA
verify return:1
depth=0 C = US, ST = Washington, L = Redmond, O = Microsoft Corporation, CN = removedhash.database.azure.com
verify return:1
---
Certificate chain
 0 s:C = US, ST = Washington, L = Redmond, O = Microsoft Corporation, CN = removedhash.database.azure.com
   i:C = US, O = DigiCert Inc, CN = DigiCert SHA2 Secure Server CA
 1 s:C = US, O = DigiCert Inc, CN = DigiCert SHA2 Secure Server CA
   i:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root                                                                                CA
 2 s:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root                                                                                CA
   i:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root                                                                                CA

Requested Signature Algorithms: ECDSA+SHA256:ECDSA+SHA384:ECDSA+SHA512:Ed25519:E                                                                               d448:RSA-PSS+SHA256:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA-PSS+SHA256:RSA-PSS+SHA384:                                                                               RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA224:ECDSA+SHA1:RSA+SHA2                                                                               24:RSA+SHA1
Shared Requested Signature Algorithms: ECDSA+SHA256:ECDSA+SHA384:ECDSA+SHA512:Ed                                                                               25519:Ed448:RSA-PSS+SHA256:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA-PSS+SHA256:RSA-PSS+                                                                               SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 8913 bytes and written 839 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : TLS_AES_256_GCM_SHA384
    Session-ID: 441A89869FA67AE2B6E730907FB563C4103DA580AA1CD249445439FD6652CF19
    Session-ID-ctx:
    Resumption PSK: EC984194F66930E86B393A88C7E5C7EA7BC32C0D8D12743AF40E8E67285E                                                                               E6F0845B1799FFCDB24AB3096D42AAF9AE5F
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 7200 (seconds)
    TLS session ticket:
    0000 - 33 e5 9b d1 be 3d ee 94-79 33 c0 fd 7d 7f 63 34   3....=..y3..}.c4
    0010 - 62 ca 74 ab a6 bb 76 52-52 2a 6f 63 79 36 95 e1   b.t...vRR*ocy6..

    Start Time: 1715759464
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: no
    Max Early Data: 0
ioolkos commented 4 months ago

We'll need to bite the bullet and implement more options for all plugins that need outgoing SSL.

Those are:

The reason is that OTP 26 defaults to verify_peer for clients. Surprisingly, there's no way to configure this via application environment. Another option would be to fall back to OTP 25.

@cambrosch one thing I wonder though: what happens when you set postgres host to an IP address instead of a name (if that's possible for your Azure env).

EDIT: just to be clear: it's of course not a bad thing to harden requirements with verify_peer. It will require the client to have access to a CA file so that it can verify the server. But I think the hostname_check (SNI) is then also triggered by that.


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

mths1 commented 4 months ago

@ioolkos : I can reproduce this. Azure DB with default microsoft certificates fail as described. Using an IP didn't make any difference.

ioolkos commented 4 months ago

@mths1 Thanks for testing! Something like https://github.com/vernemq/vernemq/pull/2288 (untested) needed for any outgoing SSL then, to be fully OTP 26 compliant.


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ioolkos commented 3 months ago

@cambrosch just FYI, this should be adressed by https://github.com/vernemq/vernemq/pull/2284


πŸ‘‰ Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq πŸ‘‰ Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.