Windows: When loading an Identity with from_pkcs8(), running multiple servers generates handshake errors

jfaust commented 1 year ago

I initially posted this at https://github.com/steffengy/schannel-rs/issues/95, but I now believe this is a higher-level issue.

I have a project where I'm generating a self-signed certs with rcgen and then using that cert with rust-native-tls & hyper. I've run into some problems when testing that on Windows that I've isolated into a repro.

To reproduce (the order of operations here is important):

Clone https://github.com/jfaust/rust-self-signed-native-tls
In one shell, run cargo run --example server 12345
In another shell, run curl -v --insecure https://localhost:12345/ - it should succeed. This is just to confirm that everything seems to be working. Running that command over and over works just fine.
In another shell, run cargo run --example server 12346

Now run that same curl command again (curl -v --insecure https://localhost:12345/). I get this error:

curl -v --insecure https://localhost:12345/
*   Trying 127.0.0.1:12345...
* Connected to localhost (127.0.0.1) port 12345 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
* [CONN-0-0][CF-SSL] TLSv1.0 (OUT), TLS header, Certificate Status (22):
* [CONN-0-0][CF-SSL] TLSv1.3 (OUT), TLS handshake, Client hello (1):
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Certificate Status (22):
* [CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Server hello (2):
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS handshake, Certificate (11):
* [CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS header, Unknown (21):
* [CONN-0-0][CF-SSL] TLSv1.2 (OUT), TLS alert, decrypt error (563):
* OpenSSL/3.0.8: error:02000086:rsa routines::last octet invalid
* Closing connection 0
curl: (35) OpenSSL/3.0.8: error:02000086:rsa routines::last octet invalid

Sometimes it's last octet invalid, sometimes first octet invalid, sometimes data too large for modulus.

If you restart the first server, the curl command will work again. However, if you now run it against the second server (port 12346), that one no longer works (you can test it before step (5) to confirm it worked initially).

What's happening is that both processes are using the same container name for their certificate store. Since on Windows this name maps to a certificate store that is shared across processes, my hypothesis is that the certs being added end up conflicting.

That seems to be borne out - if I change schannel::gen_container_name() to use a UUID instead of a counter, everything works great. I don't have a deep enough understanding of Windows Crypto to know if that's a good answer though (I suspect not). Any ideas on what exactly is conflicting here? And if there's some way I can prevent the conflict when generating the certificate?

jfaust commented 12 months ago

After doing a little research, it feels like there are a couple options here:

Add some ability for the user to specify the base container name on windows - e.g. to replace "native-tls-" in gen_container_name().
Always use the new_keyset option and try to find a keyset name that doesn't exist yet (e.g. through repeated calls to gen_container_name(), and/or addition of a random factor in the container name).

A combination of both seems like the most flexible option.

Since I don't think there's any time when native-tls loads a cert store on Windows without immediately adding a key to it, deleting the keyset when the Identity is dropped would help prevent leaking these keysets as well. This isn't a full solution since a crash would still leak. I think this requires an addition to the schannel-rs API. It's unfortunate there's no CRYPT_TEMPORARYKEYSET option for CryptAcquireContext that doesn't store the keyset on disk.

jfaust commented 12 months ago

Looks like importing an identity via from_pkcs12 shouldn't have this problem? I'd guess it should also being using the no_persist_key key option in PfxImportOptions

sfackler commented 12 months ago

The Windows identity implementation has to go through some pretty horrible hoops - schannel really assumes you're using a key loaded at the system level.

Swapping the sequence counter out for a UUID makes sense to me as a solution, though I am a bit paranoid that this process leaks persistently into the global Windows environment.

I would expect that loading a PKCS#12 identity would work better than PKCS#8, yeah.

jfaust commented 12 months ago

Now that I know a bit more, I'd be hesitant to use a UUID there because it'll leak the keyset every run (leaving an extra file on disk).

Honestly for my use case I think it makes sense to switch to rustls to sidestep the whole thing.

sfackler commented 12 months ago

That's definitely the way to go if you have the option.

jfaust commented 12 months ago

Okay, success using rustls - I'll let you decide whether to close this. I can definitely see someone else running into this at some point, but I also don't know that there's a great solution.

sfackler commented 12 months ago

Yeah it's definitely a problem we'll want to fix.

sfackler / rust-native-tls

Windows: When loading an Identity with from_pkcs8(), running multiple servers generates handshake errors #275