omega8cc / boa

Barracuda Octopus Aegir 5.2.0
https://omega8.cc/compare
395 stars 77 forks source link

Let's Encrypt issues #1780

Closed serrato-dan closed 1 month ago

serrato-dan commented 1 month ago

I just want to see if anyone else is having issues getting certificates from Let's Encrypt. We have successfully received certs for several years in our BOA setups, but for some reason this last week, we are getting failures during the challenge validation. We are getting the message that it's likely a firewall issue, but we are pretty sure our network team hasn't changed anything.

Just wondering if it's something new that others are experiencing as well.

Thanks. Dan

omega8cc commented 1 month ago

Please check details in the /var/xdrago/log/daily/daily*.log files to determine the culprit.

Sent with GitHawk

serrato-dan commented 1 month ago

Thank you for pointing out the logs to check. I'm not sure those logs are helping in this case because I'm trying to get a let's encrypt certificate from the admin interface and it's failing to validate. So, on a site that has failed and not completed receiving a cert, the log is not indicating anything except that it's a listed site. On a site that has its certificate, it goes through the process of "Running LE cert check directly..." and the check on domain names and the expire date. But for this other site that is not receiving a cert from the beginning, it's only listing that it's one of the sites in the counting process.

When turning on encryption for a site and it runs through the verify task it is failing to receive a cert and giving the following as the main part of the failure (I'm just giving some of the information and not including domains, etc.):

Challenge validation has failed :( ERROR: Challenge is invalid! (returned: invalid) (result: ["type"] "http-01" ["status"] "invalid" ["error","type"] "urn:ietf:params:acme:error:connection" ["error","detail"] "During secondary validation: 206.XX.XXX.XXX: Fetching http://www.DOMAIN.org/.well-known/acme-challenge/LONG_KEY_THAT_I_REMOVED: Timeout during connect (likely firewall problem)" ["error","status"] 400 ["error"] {"type":"urn:ietf:params:acme:error:connection","detail":"During secondary validation: 206.XX.XXX.XXX: Fetching http://www.DOMAIN.org/.well-known/acme-challenge/LONG_KEY_THAT_I_REMOVED: Timeout during connect (likely firewall problem)","status":400} ["url"]

I've been working with my Network team to try and figure it out and they are seeing the acme-vo2.api.letsencrypt.org is being allowed to send traffic.

I've been reading that part of the process might involve the secondary and other checks could be coming from global locations and I'm wondering if the firewall could be set to not allow traffic from other global areas.

omega8cc commented 1 month ago

The system logs are easier to check than to expand lines in the task log. Couldn’t reproduce this, though, so maybe the problem is related to firewall if you have checked that all sites aliases have valid DNS resolving to server public IP. Have you run complete upgrades to current release?

Sent with GitHawk

serrato-dan commented 1 month ago

I have upgraded one of my servers to 5.2.0 Lite and the other is still on 5.1.0 Head. It's happening on both servers.

We've been making sure the sites are having site aliases resolving to the IP address. By following the documentation we understand that to mean that the main domain has it @ record to the IP address of the server and an extra A record for www pointing to the IP as well.

omega8cc commented 1 month ago

Since we don’t experience this on any server across all locations it’s probably your network firewall issue.

Sent with GitHawk

serrato-dan commented 1 month ago

One thing I'm wondering about given that it's a 400 error and it's not getting what it's looking for...

In the error it says:

Fetching http://www.DOMAIN.org/.well-known/acme-challenge/LONG_KEY_THAT_I_REMOVED: Timeout during connect (likely firewall problem)" ["error","status"] 400 ["error"]

Is it actually trying to fetch from the .well-know folder that is in the message? I'm not seeing that kind of folder in my actual sites. Or is it just part of the process and as it's checking the /tools/le/certs folder and others?

Just wondering if the system in not creating the .well-known folder in the site and therefore not getting the challenge key.

omega8cc commented 1 month ago

It’s an alias mapped to the certs directory, it doesn’t exist within the site directory.

If the problem affects only some sites and not others, then you could try to disable encryption, verify and then enable it again. Kind of soft reset of the configuration but can’t see anything else to suggest without checking your system.

Also purge local firewall with csf -df

Sent with GitHawk

serrato-dan commented 1 month ago

We have tried the disabling, verifying, enabling to no success. But, we are actively trying to figure it out.

hmmmm... yea, maybe local firewall. thanks, I'll check that as well.

serrato-dan commented 1 month ago

Dang... clearing the local firewall didn't work either. Would there be any benefit from adding the /tools/le/.ctrl/ssl-demo-mode.pid back in and then removing it? Would that maybe re-establish with Let's Encrypt?

omega8cc commented 1 month ago

Your could try but I doubt it will help if the errors are about access and not LE account.

I would rather start with cold Nginx restart to make sure it's not something related to Nginx memory/cache:

service nginx stop killall -9 nginx

serrato-dan commented 1 month ago

going to try this.

Aslo, when I type 'nginx' I get:

nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use) nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use) nginx: [emerg] still could not bind()

omega8cc commented 1 month ago

You shouldn’t type nginx but kill it with commands we have listed.

Sent with GitHawk

serrato-dan commented 1 month ago

Thanks. Now I realized that using just 'nginx' wasn't helping.

I ran the commands and then did a restart. Still not working.

Thank you so much for all the feedback and the assistance. I just wish we could figure out what's going on. It's so frustrating. So, I really appreciate your attempts to assist. Thanks.

serrato-dan commented 1 month ago

Breakthrough!! For anyone else that may need this kind of info in the future. In the back of mind, the firewall situation was always the suspect, but our network team was seeing that the acme-v02.api.letsencrypt.org was making it through, so it was leading us down a different path to check lots of things... server, nginx, updates, dns, aliases, etc.

Our network team kept at it and they were able to filter the firewall logs down and see some denies for out-of-country connections from AWS and Google data centers from Sweden, etc.

Just want to thank you all for taking the time to listen to my desperate questions and give feedback. I appreciate it greatly.

And, I appreciate this project so much. It helps us our organization serve our community.