Provide an option for TLS registry to automate a complete Let's Encrypt flow

sberyozkin commented 1 month ago

Description

Initial Quarkus Let's Encrypt feature uses TLS registry Let's Encrypt route to manage ACME challenges but expects Quarkus CLI to initiate an initial certificate acquisition, subsequent renewals. It allows a full admin control over the way the Let's Encrypt process is managed, where and how the account material is stored (right now it is a .letsencrypt folder with more options to be added as needed).

However, in some cases, it can be nice to be able to have Quarkus application itself, with the help from TLS registry, to do everything that needs to be done for Let's Encrypt certificate renewal to work: create an account, initiate an initial certificate request, have a Vert.x timer scheduled certificate renewal.

This enhancement request is about exploring what can be realistically done to support a completely automated Let's Encrypt flow, how to manage permissions required to stores the account and the key pairs, decide on targeted deployments where such an option can be recommended.

Implementation ideas

Start with an experimental automated Let's Encrypt feature in devmode only. Add a configuration group related configuring details such as an account email, staging or prod Let's Encrypt server, target folder, timer for renewing, etc.

sberyozkin commented 1 month ago

I've got a notification from Let's Encrypt that a staging account for my NGrok domain is going to be removed, but what is interesting, there was this message included:

We recommend renewing certificates automatically when they have a third of their total lifetime left. For Let's Encrypt's current 90-day certificates, that means renewing 30 days before expiration. See https://letsencrypt.org/docs/integration-guide/ for details.

I think this is relevant to this enhancement request. @cescoffier, @pragmasoft-ua, FYI.

cescoffier commented 1 month ago

First, the message you got is the regular recommendation. The "automatic" can mean a cron job (or a kubernetes job).

My main issue with your proposal is the requirement to re-implement a complete Let's Encrypt client, as there is no way to use the Elytron at runtime without paying a considerable price. It is doable, but it cannot be done in a few hours (and will require exhaustive testing).

pragmasoft-ua commented 1 month ago

..as there is no way to use the Elytron at runtime without paying a considerable price. It is doable, but it cannot be done in a few hours (and will require exhaustive testing).

Can you please explain what exactly the problem with Elytron is, so maybe we can suggest alternatives? Is Elytron not a runtime library? Can't we then rely on acme4j instead? Don't we use Elytron at runtime anyway for certificate management? Is it non modular? Has too much of dependencies? Has too large memory footprint? Has native dependencies? Or the problem is not with the Elytron itself but with some time constraints you seems have?

Also, it would be good to know what other constraints do you have, because designing an architecture is actually finding a good compromise.

Lets first brainstorm the problem from a wide angle, then concentrate on the most promising options.

Widely, I see the following options:

Runtime support in the Quarkus application's main process.
Quarkus application execs an external process to manage certificates, maybe just jar with a main method. Not compatible with native image.
Sidecar application like a certbot interacting with the Quarkus app via management interface and/or config/certificate files. Better to be native, but we'll need to support different platforms and architectures, plus dockerized.
Extend certbot via 3rd party plugin, which updates Quarkus configuration. The disadvantage is that certbot only supports linux.

cescoffier commented 1 month ago

Elytron has too many dependencies (and some are very sensible), and the HTTP client used under the hood is different from our recommended HTTP client (so it will be unmanaged). Acme4J has the same drawback.

I would go with a new client implemented on top of the Vert.x HTTP client (not even the web client to avoid an extra dependency), with a specific configuration to keep the connection with the ACME server opened, avoid connection pooling, and so on. IT must be completely reactive to avoid using worker threads used by the application (especially during renewal).

There are some nonce to generate, too, but it may block when lacking entropy. We would need to use something like https://vertx.io/docs/apidocs/io/vertx/ext/auth/VertxContextPRNG.html (but obviously not the deprecated version)

pragmasoft-ua commented 1 month ago

What's wrong with acme4j http connector? It has interface abstraction and java.net.http based default implementation, thus has no any external http client dependencies besides java runtime library?

Why you're ever trying to optimize the http client given it will make maximum several calls per several months ?

Frankly for me it looks like either personal or political, not technical reasons.

cescoffier commented 1 month ago

There are multiple issues with acme4j. First, the default (the JDK HTTP client) is something we want to use sparingly at runtime (basically, we even recommend not using it). That client has many issues, especially regarding connection management (or the lack of connection management - if you have a question, avoid that. Ask @jponge, who fought with it for a few months). The API exposed by the abstractions is imperative (and blocking), meaning the integration will be imperative and blocking. The main consequence is that the process needs to be executed on a worker thread. As you said, it should happen rarely and not be an issue.... until it becomes an issue. We had to change multiple extensions in Quarkus because of this. When an application runs under load, using worker threads for management should be avoided. For example, we had to change OIDC, health and metrics (metrics are using a scraping approach) for that exact reason.

Let's imagine we ignore this. From my observation, the ACME exchanges can take 30s-40s (the complete process). So we are monopolizing a worker thread for half a minute. That can be seen as nothing or be terrible for an application under load (as for 30s, it will not be able that thread to process requests).

It should not be a problem with virtual threads, as threads are cheap. I'm not going to dive into the details and debate whether a virtual thread is a thread; I've written several blog posts about that subject. While this would improve the situation, virtual threads can only be used in some places (Java 21+, in theory, Java 19+, but I would not recommend using them in 19). Also, part of the acme process will monopolize for CPU-intensive work that virtual thread and, thus, the carrier thread. So, it would require a bit of tuning to avoid carrier thread starvation.

But let's go back to acme4j. The HTTP transport is not the only issue. It uses bouncycastle (BC). BC is great (the current support uses it in the CLI). Still, we try to reduce its usage at runtime (the CLI being an external process), especially in the core part of Quarkus (the let's encrypt support we are discussing should be located in the vert.x HTTP extension). We are still struggling to see if BC can be used in a FIPS environment, and it's relatively big in size. Now, can it be done in a separate extension, which would avoid that issue? I think so, but it's a bit too early to say. The logging is not convenient either, but let's say it's not really a big problem.

So, basically, we have two issues with acme4j:

it forces blocking I/O and thus uses worker threads
it relies on BC, which would mean we need to move that process to a separate extension

Frankly for me it looks like either personal or political, not technical reasons.

Can you please avoid that kind of comment? That's the second time already. I've provided an explanation each time. Maybe not enough, and I'm sorry about that (happy to clarify). Now, if you disagree with my recommendation, that's fine and fair. As I said, it can be done in a completely separate extension, and there is no problem with having an extension doing that in the quarkiverse. In Quarkus Core, there are many concerns to take into account.

pragmasoft-ua commented 1 month ago

it forces blocking I/O and thus uses worker threads

I don't agree this is the case. Blocking interface doesn't force blocking i/o (it's easy to adapt async io as a blocking one). Neither it mandates blocking worker threads - you m.

it relies on BC, which would mean we need to move that process to a separate extension

This makes sense anyway. In 90% of cases https will be terminated by some external reverse proxy or load balancer or CDN or anything else. See no reason to add a core functionality which will not be used frequently. Also, don't you know what's the reason BC is ever used by acme4j? Won't we need it anyway even with our custom implementation?

Can you please avoid that kind of comment?

Ok, but it makes sense to distinguish technical and political constraints early, as well as admit that latter may exist. While it makes sense to discuss technical constraints, we can only accept the political ones.

cescoffier commented 1 month ago

I don't agree this is the case. Blocking interface doesn't force blocking i/o (it's easy to adapt async io as a blocking one). Neither it mandates blocking worker threads - you m.

Sure, but, the process would still need to be called on a worker thread, because it expects the I/O to response in an imperative fashion. I can use a non blocking client. However it won't help at all, as I need to block a worker thread to get the response.

So, even if I implement the transport SPI with a non-blocking client, the SPI expects the response in an imperative fashion (res = callRemoteService()). This means I need to block the caller thread until I get the response.

It's a technique we use (even abuse) when using virtual threads, because blocking the virtual thread is cheap. But in the context of a regular worker thread (by regular I mean platform thread), it should be avoided (for the reason I explained).

cescoffier commented 1 month ago

Also, don't you know what's the reason BC is ever used by acme4j? Won't we need it anyway even with our custom implementation?

Honestly, I'm not totally sure why BC is required. I know there is a random number generation (I mentioned this above), but there might be more.

quarkusio / quarkus

Provide an option for TLS registry to automate a complete Let's Encrypt flow #43461

Description

Implementation ideas