nerves-project / nerves_system_rpi4

Base Nerves system configuration for the Raspberry Pi 4
Apache License 2.0
91 stars 49 forks source link

Nerves Hub connection fails after upgrading to 1.19.0 #160

Closed tonnenpinguin closed 2 years ago

tonnenpinguin commented 2 years ago

Environment

Elixir 1.13.4 (compiled with Erlang/OTP 25)

* Nerves environment: (`mix nerves.env --info`)

|nerves_bootstrap| Environment Package List

Pkg: nerves_toolchain_ctng Vsn: 1.8.5 Type: toolchain_platform BuildRunner: {nil, []}

Pkg: nerves_system_br Vsn: 1.19.0 Type: system_platform BuildRunner: {nil, []}

Pkg: nerves_system_rpi4 Vsn: 1.19.0 Type: system BuildRunner: {Nerves.Artifact.BuildRunners.Docker, [make_args: ["source", "all", "legal-info"]]}

Pkg: nerves_toolchain_aarch64_nerves_linux_gnu Vsn: 1.5.0 Type: toolchain BuildRunner: {Nerves.Artifact.BuildRunners.Local, []}

|nerves_bootstrap| Loadpaths Start

Nerves environment MIX_TARGET: ft_access MIX_ENV: prod

|nerves_bootstrap| Environment Variable List target: ft_access toolchain: /Users/tonnenpinguin/.nerves/artifacts/nerves_toolchain_aarch64_nerves_linux_gnu-darwin_x86_64-1.5.0 system: /Users/tonnenpinguin/.nerves/artifacts/nerves_system_rpi4-portable-1.19.0 app: .

|nerves_bootstrap| Loadpaths End

* Additional information about your host, target hardware or environment that
  may help

### Current behavior

After upgrading from `1.18.4` devices fail to connect to our self-hosted nerves hub instance (running on OTP 24) with the following error

12:27:21.835 [error] mfa=:gen_server.error_info/8 line=1399 GenServer #PID<0.2705.0> terminating ** (CaseClauseError) no case clause matching: {:error, %Mint.HTTP1{buffer: "", host: "device.nerves-hub.[...]", mode: :active, port: 443, private: %{extensions: [], scheme: :wss, sec_websocket_key: "SEC_WEBSOCKET_KEY"}, proxy_headers: [], request: %{body: nil, connection: [], content_length: nil, data_buffer: [], headers_buffer: [], method: "GET", ref: #Reference<0.578604679.268435457.124554>, state: :status, status: nil, transfer_encoding: [], version: nil}, requests: {[], []}, scheme_as_string: "https", socket: {:sslsocket, {:gen_tcp, #Port<0.69>, :tls_connection, :undefined}, [#PID<0.2708.0>, #PID<0.2707.0>]}, state: :closed, streaming_request: nil, transport: Mint.Core.Transport.SSL}, %Mint.TransportError{reason: {:tls_alert, {:certificate_required, 'TLS client: In state connection received SERVER ALERT: Fatal - Certificate required\n'}}}, []} (slipstream 1.0.0) lib/slipstream/connection/pipeline.ex:140: Slipstream.Connection.Pipeline.decode_message/1 (slipstream 1.0.0) lib/slipstream/connection/pipeline.ex:34: anonymous fn/1 in Slipstream.Connection.Pipeline.handle/2 (slipstream 1.0.0) lib/slipstream/connection/telemetry.ex:29: anonymous fn/2 in Slipstream.Connection.Telemetry.span/2 (telemetry 1.1.0) /Users/tonnenpinguin/src/cardreader/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3 (slipstream 1.0.0) lib/slipstream/connection/telemetry.ex:25: Slipstream.Connection.Telemetry.span/2 (stdlib 4.0) gen_server.erl:1120: :gen_server.try_dispatch/4 (stdlib 4.0) gen_server.erl:1197: :gen_server.handle_msg/6 (stdlib 4.0) proc_lib.erl:240: :proc_lib.init_p_do_apply/3 Last message: {:ssl_error, {:sslsocket, {:gen_tcp, #Port<0.69>, :tls_connection, :undefined}, [#PID<0.2708.0>, #PID<0.2707.0>]}, {:tls_alert, {:certificate_required, 'TLS client: In state connection received SERVER ALERT: Fatal - Certificate required\n'}}}



No changes have been made on the server side, and after reverting everything works as expected again.

On [Slack](https://elixir-lang.slack.com/archives/CANV3GCAV/p1655728265549179) @lostkobrakai suggested this might have something to do with the changes made to ssl in OTP 25.
After some further digging we found out that the underlying problem isn't triggered when the server is running OTP23.

This is very problematic, since the way slipstream works in all but the latest version (1.0.1, released yesterday) we actually do not recover from a failed connection attempt, meaning devices might not be able to connect to NH until they are rebooted (and the server is set to run on OTP 23).

To be clear the problem doesn't seem to be specific to nerves_system_rpi4, but rather a general interop issue between OTP 25 clients and OTP 24/OTP 25 server versions, but I wanted to raise a flag anyway  

### Expected behavior
For devices to connect to nerves hub :)
fhunleth commented 2 years ago

@tonnenpinguin We've been using OTP 25 quite a while and haven't seen this issue. Since the error is "SERVER ALERT: Fatal - Certificate required", I'd verify that the device's certificate is being passed. However, I don't know why it would change between OTP versions. I think that I'd check if any dependencies need to be updated. Maybe one's out of date and it's important and the slipstream or mint or nerves_hub_link isn't tight enough on its dependency settings to force it?

@jjcarstens Any ideas?

fhunleth commented 2 years ago

I spoke offline with @jjcarstens. I understand that there's been quite a bit of debug and there's some SSL/TLS incompatibility between OTP 23 and OTP 24/25 between devices and servers that is unknown.

tonnenpinguin commented 2 years ago

Yeah, everything works great as long as the server is running OTP >=23, which I think is the case with your installation.

I created a bare minimum project that triggers the issue. All you need to do is follow the setup as documented in poser and copy over the ssl & nerves-hub directories.

Seems like it has nothing to do with nerves at all and the bug is either in mint or in otp/ssl.

FWIW Here's two traces with different client/server version pairs

This works otp25client_otp23server.txt This fails otp25client_otp24server.txt

fhunleth commented 2 years ago

Based on the trace, it looks like the server is passing the certificate_authorities extension with OTP 24 AND the client pays attention to it. The server shouldn't do this. The Erlang SSL docs make it sound like it defaults to off. However, this code really looks like it's always sent with TLS1.3, but I didn't trace to see if the certificate_request/4 function is the right one to look at.

jjcarstens commented 2 years ago

I have a recon trace as well as some Wireshark captures (yes, the trace has lots of cert/key data but its only used locally so #yolo) poser.log wireshark_captures.zip

tonnenpinguin commented 2 years ago

@fhunleth yeah I think that's pretty much it. To double check I ran NHW with OTP 24.1 (<24.3 when they introduced the certificate_authorities field) and everything works as expected.

From the commit message that introduced the code you posted:

In TLS-1.3 consider certificate_authorities extension if present in client hello or certificate request. Add the extension in certificate requests, and possibility to have it in client hello.

Looking through the changes it seems like that the option is indeed respected on the initial ClientHello sent by the client, but the server always responds with the list as you pointed out.

tonnenpinguin commented 2 years ago

I think the problem might be an overly strict interpretation of the RFC in this piece of code

In section 4.3.2 to the RFC states

The "certificate_authorities" extension is used to indicate the certificate authorities (CAs) which an endpoint supports and which SHOULD be used by the receiving endpoint to guide certificate selection.

The way things are implemented the certificate_authorities field not only guides the cert selection, but filters out all certs, even if we might have one available. Should we create a ticket in https://github.com/erlang/otp for this?

I could do it but to be honest I'm not fully clear on why the client certs aren't part of the certificate chain yet. Shouldn't the Device signer cert be an intermediate CA for the Device Root CA? In my understanding that would mean that the ssl_certificate:certificate_chain check should succeed for the device cert, no?

fhunleth commented 2 years ago

The device certificate chain is only used to automatically register devices with NervesHub. In the multi-homed use case, the CA chain depends on whatever certs the user uploads to NH. NervesHub pins device certificates once it knows about the device and auths against that. The CA chain certs can expire, be deleted, or withdrawn. It doesn't matter since after first registration they're unused. In fact, if you upload device certs through the UI or API, NH doesn't need to check the chain since you've authenticated with NH and are telling NH to trust the device cert that you gave it.

The main issue is on the server side for including the certificates. I see that there's a "SHOULD" vs. "MUST" interpretation on how the certificate_authorities extension is processed on the device side. I don't feel like I know the ramifications elsewhere on ignoring the extension. It seems wrong to ignore since the server is trying to tell you what it accepts to save the client time in supplying certs that won't work. In our case the server is supplying the wrong certificates in the certificate_authorities extension and it can't supply acceptable CA certs since that list might be very long. I think that it's better to drop the extension.

What I'd do is comment out the line in OTP that includes the certificate_authorities extension in TLS 1.3 to confirm that that fixes the issue. Then if it does, create a ticket on OTP that the extension should be optionally included like is done for TLS 1.2. @jjcarstens was going to try commenting out the line, but I don't know if he has gotten to it yet.

jjcarstens commented 2 years ago

I've confirmed @fhunleth theory - Removing the certificate_authories from the extensions fixes the issue and allows the device to connect.

diff --git a/lib/ssl/src/tls_handshake_1_3.erl b/lib/ssl/src/tls_handshake_1_3.erl
index 87855f2989..8204604a88 100644
--- a/lib/ssl/src/tls_handshake_1_3.erl
+++ b/lib/ssl/src/tls_handshake_1_3.erl
@@ -213,10 +213,10 @@ certificate_request(SignAlgs0, SignAlgsCert0, CertDbHandle, CertDbRef) ->
     SignAlgsCert = filter_tls13_algs(SignAlgsCert0),
     Extensions0 = add_signature_algorithms(#{}, SignAlgs),
     Extensions = add_signature_algorithms_cert(Extensions0, SignAlgsCert),
-    Auths = ssl_handshake:certificate_authorities(CertDbHandle, CertDbRef),
+    _Auths = ssl_handshake:certificate_authorities(CertDbHandle, CertDbRef),
     #certificate_request_1_3{
       certificate_request_context = <<>>,
-      extensions = Extensions#{certificate_authorities => #certificate_authorities{authorities = Auths}}}.
+      extensions = Extensions#{}}.
For records sake, this is what that `Auths` value looks like in my request ``` AUTHS: [{rdnSequence, [[{'AttributeTypeAndValue', {2,5,4,10}, {utf8String,<<"NervesHub">>}}], [{'AttributeTypeAndValue', {2,5,4,3}, {utf8String,<<"NervesHub Device Root CA">>}}]]}, {rdnSequence, [[{'AttributeTypeAndValue', {2,5,4,10}, {utf8String,<<"NervesHub">>}}], [{'AttributeTypeAndValue', {2,5,4,3}, {utf8String,<<"NervesHub User Root CA">>}}]]}, {rdnSequence, [[{'AttributeTypeAndValue', {2,5,4,10}, {utf8String,<<"NervesHub">>}}], [{'AttributeTypeAndValue', {2,5,4,3}, {utf8String,<<"NervesHub Server Root CA">>}}]]}, {rdnSequence, [[{'AttributeTypeAndValue', {2,5,4,10}, {utf8String,<<"NervesHub">>}}], [{'AttributeTypeAndValue', {2,5,4,3}, {utf8String,<<"NervesHub Root CA">>}}]]}, {rdnSequence, [[{'AttributeTypeAndValue', {2,5,4,10}, {utf8String,<<"AlloyAccess">>}}], [{'AttributeTypeAndValue', {2,5,4,3}, {utf8String,<<"QA Signer">>}}]]}] ```
pojiro commented 2 years ago

@tonnenpinguin @fhunleth @jjcarstens

Currently I try to update self-hosted NervesHub for dev to follow upstream v1.0.2, and I also faced this issue too. This thread is very helpful me, thank you.

I confirmed that NervesHub v1.0.2 with below workaround setting could handshake with client, rpi3 1.17.3(OTP 24.1.4), 1.19.0(OTP 25.0) and 1.20.0(OTP 25.0.2).

# versions on Dockerfile header
ARG ELIXIR_VERSION=1.13.3
ARG ERLANG_VERSION=24.2.2  # this works fine.
ARG ALPINE_VERSION=3.15.0
ARG NODE_VERSION=16.14.2

I don't fully understand the issue details, but I hope this information will give you some ideas for workarounds.

jjcarstens commented 2 years ago

Until the issues in OTP are resolved, the current workaround is to use a lower supported OTP (< 24.3) or force TLS 1.2 on your device with versions: [:"tlsv1.2"] in your :ssl options

jjcarstens commented 2 years ago

Fixes are in place with https://github.com/erlang/otp/pull/6228 - Once that is merged into OTP ~25.1~ 25.2, we can alter the NervesHubWeb configuration to use certificate_authorities: false.

Closing this issue for now since it is not a rpi4 specific case and we'll adjust in NervesHub when available.

tonnenpinguin commented 2 years ago

Thanks Jon! One minor clarification for future readers: The flag is set to be added in OTP 25.2 :)

jjcarstens commented 2 years ago

🤦 derp - yes. I updated the comment