Closed markgoddard closed 6 months ago
many thanks @markgoddard :beer: I am currently testing this branch with this inventory
:
# Example inventory for deployment to a single host (localhost).
# HAProxy load balancer.
# Should contain exactly one host.
[haproxy]
activeh
# Jaeger distributed tracing UI.
# Should contain at most one host.
[jaeger]
activeh
# Minio object storage service (for test & development only).
# Should contain at most one host.
[minio]
activeh
# Prometheus monitoring service.
# Should contain at most one host.
[prometheus]
activeh
# Reductionist servers.
# May contain multiple hosts.
[reductionist]
activeh
active2
active3
# Step Certificate Authority (CA).
# Should contain exactly one host.
[step-ca]
activeh
# Do not edit.
[step:children]
reductionist
# Do not edit.
[docker:children]
haproxy
jaeger
minio
prometheus
reductionist
step-ca
but it's currently hanging at "Gethring facts":
[vpredoi@activeh ~]$ ansible-playbook -i reductionist-rs/deployment/inventory reductionist-rs/deployment/site.yml
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
PLAY [Install Docker] ******************************************************************************************************************************
TASK [Gathering Facts] *****************************************************************************************************************************
ok: [active2]
ok: [active3]
I can send you the debug log but there are no critical issues there, just a lot of fluff, and the ssh connections to active2 and active3 work fine. Any clues?
also the deps have not been updated, just for logging reasons:
[vpredoi@activeh ~]$ pip install -r reductionist-rs/deployment/requirements.txt
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: ansible-core<2.16 in ./.local/lib/python3.9/site-packages (from -r reductionist-rs/deployment/requirements.txt (line 1)) (2.15.10)
Requirement already satisfied: jinja2>=3.0.0 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (3.1.3)
Requirement already satisfied: PyYAML>=5.1 in /usr/lib64/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (5.4.1)
Requirement already satisfied: cryptography in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (42.0.5)
Requirement already satisfied: packaging in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (24.0)
Requirement already satisfied: resolvelib<1.1.0,>=0.5.3 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (1.0.1)
Requirement already satisfied: importlib-resources<5.1,>=5.0 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (5.0.7)
Requirement already satisfied: MarkupSafe>=2.0 in ./.local/lib/python3.9/site-packages (from jinja2>=3.0.0->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (2.1.5)
Requirement already satisfied: cffi>=1.12 in ./.local/lib/python3.9/site-packages (from cryptography->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (1.16.0)
Requirement already satisfied: pycparser in ./.local/lib/python3.9/site-packages (from cffi>=1.12->cryptography->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (2.22)
:beer:
@valeriupredoi I'd diff your new inventory against your old one. I expect you don't want to deploy minio.
Do you have activeh in /etc/hosts on activeh? Perhaps previously you were referring to it as localhost?
yessir! Here's my hosts file:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.171.169.xxx activeh
192.168.3.xxx active2
192.168.3.xxx active3
with aactual numbers not xxx - no, I was using exactly the same inventory as now, but with three machines listed under HAproxy, etc/hosts has not changed either
let me try rerun with my previous configuration, see if that goes through (with the error at the end), so we can isolate the issue
right! So the thing now hangs with my old setup from yesterday as well :man_facepalming: Need to see what's happened in the meantime
@markgoddard apols for the tardiness: this is the bit that's hanging:
TASK [Gathering Facts] *****************************************************************************************************************************
task path: /home/vpredoi/reductionist-rs/deployment/site.yml:4
<activeh> ESTABLISH SSH CONNECTION FOR USER: None
<activeh> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o 'ControlPath="/home/vpredoi/.ansible/cp/33982bffcc"' activeh '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''
-> the funny thing is that if I run that command myself, all is fine...am very confused :confounded:
figured it out thanks to @RosalynHatcher whom I owe a massive :beer: - but am back to the Bootstrap CA issue now:
TASK [Bootstrap CA] ********************************************************************************************************************************
skipping: [activeh]
fatal: [active2]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.016529", "end": "2024-04-17 15:35:25.308223", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 15:35:25.291694", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}
fatal: [active3]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014800", "end": "2024-04-17 15:35:25.340400", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 15:35:25.325600", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}
:confounded:
note that the reductionist verification also fails, for activeh
- which is kinda to be expected since the CA massage was skipped there:
TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}
:beer:
Your bootstrapping is failing with network errors:
dial tcp 192.168.3.212:9999: connect: no route to host
Perhaps the port is not open in a firewall/security group?
Also looks like you are not using the changes in this PR because the task to get the fingerprint is skipped on activeh
I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?
yes it is indeed skipped - the problem is with active2 and active3 for Bootstrap CA - but even activeh fails at the end with reductionist unreachable - very possible it's how @bnlawrence has configured active2 and 3?
this is the full partial section:
TASK [Check whether step has been bootstrapped] ****************************************************************************************************
ok: [active2]
ok: [active3]
ok: [activeh]
TASK [Get CA fingerprint] **************************************************************************************************************************
ok: [activeh]
TASK [Bootstrap CA] ********************************************************************************************************************************
skipping: [activeh]
fatal: [active2]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014868", "end": "2024-04-17 16:14:49.526672", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 16:14:49.511804", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}
fatal: [active3]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014415", "end": "2024-04-17 16:14:49.538207", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 16:14:49.523792", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}
TASK [Install root certificate to system] **********************************************************************************************************
skipping: [activeh]
,,,
I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?
Perhaps you still have the code changes you made previously? That task should not be skipped if using this branch.
I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?
Perhaps you still have the code changes you made previously? That task should not be skipped if using this branch.
ok, my mistake - it's the Get CA fingerprint
task that should not be skipped, and it's not.
What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535
note that the reductionist verification also fails, for
activeh
- which is kinda to be expected since the CA massage was skipped there:TASK [Wait for reductionist server to be accessible via HAProxy] *********************************************************************************** FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left). FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left). FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left). fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}
🍺
It's failing with certificate expired. Step CA uses short-lived certificates, so probably the renewal isn't working for some reason.
OK:
[vpredoi@active2 ~]$ sudo firewall-cmd --add-port=9999/tcp
Warning: ALREADY_ENABLED: '9999:tcp' already in 'public'
success
and same for active3, and regarding the current reductionist-rs:
[vpredoi@activeh ~]$ cd reductionist-rs/
[vpredoi@activeh reductionist-rs]$ git status
On branch deployment
Your branch is up to date with 'origin/deployment'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: deployment/inventory
no changes added to commit (use "git add" and/or "git commit -a")
where the inventory file is changed to have it for this deployment. At any rate, I just pulled the latest:
[vpredoi@activeh reductionist-rs]$ git pull origin deployment
From https://github.com/stackhpc/reductionist-rs
* branch deployment -> FETCH_HEAD
Already up to date.
Have you pushed any changes? :beer:
What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535
my bad, I thought the biggest one was 9090 :rofl:
Port 9999 needs to be accessible on activeh. Are you able to curl it (using HTTPS)?
What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535
my bad, I thought the biggest one was 9090 🤣
That's what they want you to believe
you, sir, are a life-saver! Completely forgot to turn on port 999 on activeh - now, massive progress, but it stumbled right at the end:
TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema"}
PLAY RECAP *****************************************************************************************************************************************
active2 : ok=20 changed=2 unreachable=0 failed=0 skipped=7 rescued=0 ignored=0
active3 : ok=20 changed=2 unreachable=0 failed=0 skipped=7 rescued=0 ignored=0
activeh : ok=44 changed=1 unreachable=0 failed=1 skipped=12 rescued=0 ignored=0
Should I regenerate the step root_ca.crt
?
:beers:
actually, hang on, just opened 8080 too (was not open :man_facepalming: ) now it clearly says cert is expired:
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}
OK that certificat is not expired:
[vpredoi@activeh ~]$ sudo step-cli certificate inspect root_ca.crt
Certificate:
Data:
Version: 3 (0x2)
Serial Number: hidden
Signature Algorithm: ECDSA-SHA256
Issuer: O=Smallstep,CN=Smallstep Root CA
Validity
Not Before: Apr 16 13:04:45 2024 UTC
Not After : Apr 14 13:04:45 2034 UTC
Subject: O=Smallstep,CN=Smallstep Root CA
Subject Public Key Info:
Public Key Algorithm: ECDSA
Public-Key: (256 bit)
Mark, any clues why the pinger would think it expired?
sorry bud, last message for today I promise (just about to go home): the deployed Reductionist on activeh is using the activeh's backend IP "url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema"
- that'll never work AFAIK since the public facing IP is needed (as I found out with the last deployment from a month ago)
OK that certificat is not expired:
[vpredoi@activeh ~]$ sudo step-cli certificate inspect root_ca.crt Certificate: Data: Version: 3 (0x2) Serial Number: hidden Signature Algorithm: ECDSA-SHA256 Issuer: O=Smallstep,CN=Smallstep Root CA Validity Not Before: Apr 16 13:04:45 2024 UTC Not After : Apr 14 13:04:45 2034 UTC Subject: O=Smallstep,CN=Smallstep Root CA Subject Public Key Info: Public Key Algorithm: ECDSA Public-Key: (256 bit)
Mark, any clues why the pinger would think it expired?
That is the root CA certificate, not the server certificate(s). Those are generated by the step CLI in ~/.config/reductionist/certs/
and have a much shorter life.
sorry bud, last message for today I promise (just about to go home): the deployed Reductionist on activeh is using the activeh's backend IP
"url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema"
- that'll never work AFAIK since the public facing IP is needed (as I found out with the last deployment from a month ago)
Perhaps we can talk about your exact network setup tomorrow, but there is a variable called reductionist_host
in deployment/group_vars/reductionist
that you can set for the frontend host on which HAProxy will expose the reductionist API.
both those two things - great clues, many thanks for taking the time with me, Mark, and my apologies for bombarding you with questions - I realize I am annoying, but, as you can see, am a total n00b at ansibles and its networking, and I want to get this done. I owe you a couple pints for sure! I reckon by tomorrow, given what you pointed me to, we'll get it to successfully deploy (and work) :beers:
both those two things - great clues, many thanks for taking the time with me, Mark, and my apologies for bombarding you with questions - I realize I am annoying, but, as you can see, am a total n00b at ansibles and its networking, and I want to get this done. I owe you a couple pints for sure! I reckon by tomorrow, given what you pointed me to, we'll get it to successfully deploy (and work) 🍻
I've pushed some changes that should help. There is a fix for the wait task that would be necessary if you modify reductionist_host
. There are also some docs changes that provide more info about how your hosts need to be setup, with required ports etc.
Mark, you're a bloody wizard! I took the latest changes you made here, opened port 8081 on active2 and active3, and the thing ran with absolutely no hitch:
PLAY RECAP *****************************************************************************************************************************************
active2 : ok=20 changed=1 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0
active3 : ok=20 changed=1 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0
activeh : ok=46 changed=1 unreachable=0 failed=0 skipped=12 rescued=0 ignored=0
I'm actually gonna poke it about see what's what, but boy am I happy to see no fails and comms and certs not barking at me :grin: :beers:
we got "Hello world!" from a remote client (me laptop) to activeh
(base) valeriu@valeriu-PORTEGE-Z30-C:~$ curl -k https://192.171.169.xxx:8080/.well-known/reductionist-schema
Hello, world!
where 192... is activeh
public facing IP, and biutiful Hello Worlds from activeh to the two active2 and active3 via backend IPs and ports 8081:
[vpredoi@activeh ~]$ curl -k https://192.168.3.xxx:8081/.well-known/reductionist-schema
Hello, world![vpredoi@activeh ~]$ curl -k https://192.168.3.xxx:8081/.well-known/reductionist-schema
Hello, world!
I am over the moon :smile:
just ran an actual PyActiveStorage test and it runs very well (let's not concern with the times just yet) - reductionist
process running on each of the three computers: active, active2, and active3 :partying_face:
absolute legend @markgoddard 🍺 Very many thanks for this and your CH (continuous help) over the past couple days. One itty bitty mention I'd put in here for others not to struggle like me is to have the ssh connection from the main node (activeh in my case) not be init-ed with an eal of the ssh-agent ie
eval $(ssh-agent -s)
because that will result in ansibles needing a password be inputted, but it's not asking for it explicitly, and instead it hangs - this is what my lovely colleague @RosalynHatcher sorted me out with, me barely speaking any ssh. Apart from that, I owe you a couple pints, mate 🍺
I've added a note about the SSH agent issue. Will merge once CI goes green again. Thanks for trying out my changes!
:beers: