stackhpc / reductionist-rs

S3 Active Storage server
Apache License 2.0
3 stars 0 forks source link

Fixes and improvements to deployment #90

Closed markgoddard closed 6 months ago

markgoddard commented 6 months ago
valeriupredoi commented 6 months ago

many thanks @markgoddard :beer: I am currently testing this branch with this inventory:

# Example inventory for deployment to a single host (localhost).

# HAProxy load balancer.
# Should contain exactly one host.
[haproxy]
activeh

# Jaeger distributed tracing UI.
# Should contain at most one host.
[jaeger]
activeh

# Minio object storage service (for test & development only).
# Should contain at most one host.
[minio]
activeh

# Prometheus monitoring service.
# Should contain at most one host.
[prometheus]
activeh

# Reductionist servers.
# May contain multiple hosts.
[reductionist]
activeh
active2
active3

# Step Certificate Authority (CA).
# Should contain exactly one host.
[step-ca]
activeh

# Do not edit.
[step:children]
reductionist

# Do not edit.
[docker:children]
haproxy
jaeger
minio
prometheus
reductionist
step-ca

but it's currently hanging at "Gethring facts":

[vpredoi@activeh ~]$ ansible-playbook -i reductionist-rs/deployment/inventory reductionist-rs/deployment/site.yml
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [Install Docker] ******************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************
ok: [active2]
ok: [active3]

I can send you the debug log but there are no critical issues there, just a lot of fluff, and the ssh connections to active2 and active3 work fine. Any clues?

valeriupredoi commented 6 months ago

also the deps have not been updated, just for logging reasons:

[vpredoi@activeh ~]$ pip install -r reductionist-rs/deployment/requirements.txt
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: ansible-core<2.16 in ./.local/lib/python3.9/site-packages (from -r reductionist-rs/deployment/requirements.txt (line 1)) (2.15.10)
Requirement already satisfied: jinja2>=3.0.0 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (3.1.3)
Requirement already satisfied: PyYAML>=5.1 in /usr/lib64/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (5.4.1)
Requirement already satisfied: cryptography in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (42.0.5)
Requirement already satisfied: packaging in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (24.0)
Requirement already satisfied: resolvelib<1.1.0,>=0.5.3 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (1.0.1)
Requirement already satisfied: importlib-resources<5.1,>=5.0 in ./.local/lib/python3.9/site-packages (from ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (5.0.7)
Requirement already satisfied: MarkupSafe>=2.0 in ./.local/lib/python3.9/site-packages (from jinja2>=3.0.0->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (2.1.5)
Requirement already satisfied: cffi>=1.12 in ./.local/lib/python3.9/site-packages (from cryptography->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (1.16.0)
Requirement already satisfied: pycparser in ./.local/lib/python3.9/site-packages (from cffi>=1.12->cryptography->ansible-core<2.16->-r reductionist-rs/deployment/requirements.txt (line 1)) (2.22)

:beer:

markgoddard commented 6 months ago

@valeriupredoi I'd diff your new inventory against your old one. I expect you don't want to deploy minio.

markgoddard commented 6 months ago

Do you have activeh in /etc/hosts on activeh? Perhaps previously you were referring to it as localhost?

valeriupredoi commented 6 months ago

yessir! Here's my hosts file:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.171.169.xxx activeh
192.168.3.xxx active2
192.168.3.xxx active3

with aactual numbers not xxx - no, I was using exactly the same inventory as now, but with three machines listed under HAproxy, etc/hosts has not changed either

valeriupredoi commented 6 months ago

let me try rerun with my previous configuration, see if that goes through (with the error at the end), so we can isolate the issue

valeriupredoi commented 6 months ago

right! So the thing now hangs with my old setup from yesterday as well :man_facepalming: Need to see what's happened in the meantime

valeriupredoi commented 6 months ago

@markgoddard apols for the tardiness: this is the bit that's hanging:

TASK [Gathering Facts] *****************************************************************************************************************************
task path: /home/vpredoi/reductionist-rs/deployment/site.yml:4
<activeh> ESTABLISH SSH CONNECTION FOR USER: None
<activeh> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o 'ControlPath="/home/vpredoi/.ansible/cp/33982bffcc"' activeh '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''

-> the funny thing is that if I run that command myself, all is fine...am very confused :confounded:

valeriupredoi commented 6 months ago

figured it out thanks to @RosalynHatcher whom I owe a massive :beer: - but am back to the Bootstrap CA issue now:

TASK [Bootstrap CA] ********************************************************************************************************************************
skipping: [activeh]
fatal: [active2]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.016529", "end": "2024-04-17 15:35:25.308223", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 15:35:25.291694", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}
fatal: [active3]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014800", "end": "2024-04-17 15:35:25.340400", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 15:35:25.325600", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}

:confounded:

valeriupredoi commented 6 months ago

note that the reductionist verification also fails, for activeh - which is kinda to be expected since the CA massage was skipped there:

TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}

:beer:

markgoddard commented 6 months ago

Your bootstrapping is failing with network errors:

dial tcp 192.168.3.212:9999: connect: no route to host

Perhaps the port is not open in a firewall/security group?

markgoddard commented 6 months ago

Also looks like you are not using the changes in this PR because the task to get the fingerprint is skipped on activeh

valeriupredoi commented 6 months ago

I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?

valeriupredoi commented 6 months ago

yes it is indeed skipped - the problem is with active2 and active3 for Bootstrap CA - but even activeh fails at the end with reductionist unreachable - very possible it's how @bnlawrence has configured active2 and 3?

valeriupredoi commented 6 months ago

this is the full partial section:

TASK [Check whether step has been bootstrapped] ****************************************************************************************************
ok: [active2]
ok: [active3]
ok: [activeh]

TASK [Get CA fingerprint] **************************************************************************************************************************
ok: [activeh]

TASK [Bootstrap CA] ********************************************************************************************************************************
skipping: [activeh]
fatal: [active2]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014868", "end": "2024-04-17 16:14:49.526672", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 16:14:49.511804", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}
fatal: [active3]: FAILED! => {"changed": true, "cmd": ["step", "ca", "bootstrap", "--ca-url", "https://192.168.3.212:9999", "--fingerprint", "01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c", "--install"], "delta": "0:00:00.014415", "end": "2024-04-17 16:14:49.538207", "msg": "non-zero return code", "rc": 1, "start": "2024-04-17 16:14:49.523792", "stderr": "error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host", "stderr_lines": ["error downloading root certificate: client GET https://192.168.3.212:9999/root/01e997c7513a7b35e51f58a653607b092b501d84d2ef998c42ea2e30bbca121c failed: dial tcp 192.168.3.212:9999: connect: no route to host"], "stdout": "", "stdout_lines": []}

TASK [Install root certificate to system] **********************************************************************************************************
skipping: [activeh]

,,,
markgoddard commented 6 months ago

I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?

Perhaps you still have the code changes you made previously? That task should not be skipped if using this branch.

markgoddard commented 6 months ago

I am - I am using this branch, bud. Port 9999 doesn't actually exist afaik, does it?

Perhaps you still have the code changes you made previously? That task should not be skipped if using this branch.

ok, my mistake - it's the Get CA fingerprint task that should not be skipped, and it's not.

markgoddard commented 6 months ago

What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535

markgoddard commented 6 months ago

note that the reductionist verification also fails, for activeh - which is kinda to be expected since the CA massage was skipped there:

TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}

🍺

It's failing with certificate expired. Step CA uses short-lived certificates, so probably the renewal isn't working for some reason.

valeriupredoi commented 6 months ago

OK:

[vpredoi@active2 ~]$ sudo firewall-cmd --add-port=9999/tcp
Warning: ALREADY_ENABLED: '9999:tcp' already in 'public'
success

and same for active3, and regarding the current reductionist-rs:

[vpredoi@activeh ~]$ cd reductionist-rs/
[vpredoi@activeh reductionist-rs]$ git status
On branch deployment
Your branch is up to date with 'origin/deployment'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   deployment/inventory

no changes added to commit (use "git add" and/or "git commit -a")

where the inventory file is changed to have it for this deployment. At any rate, I just pulled the latest:

[vpredoi@activeh reductionist-rs]$ git pull origin deployment
From https://github.com/stackhpc/reductionist-rs
 * branch            deployment -> FETCH_HEAD
Already up to date.

Have you pushed any changes? :beer:

valeriupredoi commented 6 months ago

What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535

my bad, I thought the biggest one was 9090 :rofl:

markgoddard commented 6 months ago

Port 9999 needs to be accessible on activeh. Are you able to curl it (using HTTPS)?

markgoddard commented 6 months ago

What do you mean port 9999 doesn't exist? TCP/UDP ports go up to 65,535

my bad, I thought the biggest one was 9090 🤣

That's what they want you to believe

valeriupredoi commented 6 months ago

you, sir, are a life-saver! Completely forgot to turn on port 999 on activeh - now, massive progress, but it stumbled right at the end:

TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [activeh]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema"}

PLAY RECAP *****************************************************************************************************************************************
active2                    : ok=20   changed=2    unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
active3                    : ok=20   changed=2    unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
activeh                    : ok=44   changed=1    unreachable=0    failed=1    skipped=12   rescued=0    ignored=0

Should I regenerate the step root_ca.crt? :beers:

valeriupredoi commented 6 months ago

actually, hang on, just opened 8080 too (was not open :man_facepalming: ) now it clearly says cert is expired:

fatal: [activeh]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.212:8080/.well-known/reductionist-schema"}
valeriupredoi commented 6 months ago

OK that certificat is not expired:

[vpredoi@activeh ~]$ sudo step-cli certificate inspect root_ca.crt 
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: hidden 
    Signature Algorithm: ECDSA-SHA256
        Issuer: O=Smallstep,CN=Smallstep Root CA
        Validity
            Not Before: Apr 16 13:04:45 2024 UTC
            Not After : Apr 14 13:04:45 2034 UTC
        Subject: O=Smallstep,CN=Smallstep Root CA
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)

Mark, any clues why the pinger would think it expired?

valeriupredoi commented 6 months ago

sorry bud, last message for today I promise (just about to go home): the deployed Reductionist on activeh is using the activeh's backend IP "url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema" - that'll never work AFAIK since the public facing IP is needed (as I found out with the last deployment from a month ago)

markgoddard commented 6 months ago

OK that certificat is not expired:

[vpredoi@activeh ~]$ sudo step-cli certificate inspect root_ca.crt 
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: hidden 
    Signature Algorithm: ECDSA-SHA256
        Issuer: O=Smallstep,CN=Smallstep Root CA
        Validity
            Not Before: Apr 16 13:04:45 2024 UTC
            Not After : Apr 14 13:04:45 2034 UTC
        Subject: O=Smallstep,CN=Smallstep Root CA
        Subject Public Key Info:
            Public Key Algorithm: ECDSA
                Public-Key: (256 bit)

Mark, any clues why the pinger would think it expired?

That is the root CA certificate, not the server certificate(s). Those are generated by the step CLI in ~/.config/reductionist/certs/ and have a much shorter life.

markgoddard commented 6 months ago

sorry bud, last message for today I promise (just about to go home): the deployed Reductionist on activeh is using the activeh's backend IP "url": "https://192.168.3.xxx:8080/.well-known/reductionist-schema" - that'll never work AFAIK since the public facing IP is needed (as I found out with the last deployment from a month ago)

Perhaps we can talk about your exact network setup tomorrow, but there is a variable called reductionist_host in deployment/group_vars/reductionist that you can set for the frontend host on which HAProxy will expose the reductionist API.

valeriupredoi commented 6 months ago

both those two things - great clues, many thanks for taking the time with me, Mark, and my apologies for bombarding you with questions - I realize I am annoying, but, as you can see, am a total n00b at ansibles and its networking, and I want to get this done. I owe you a couple pints for sure! I reckon by tomorrow, given what you pointed me to, we'll get it to successfully deploy (and work) :beers:

markgoddard commented 6 months ago

both those two things - great clues, many thanks for taking the time with me, Mark, and my apologies for bombarding you with questions - I realize I am annoying, but, as you can see, am a total n00b at ansibles and its networking, and I want to get this done. I owe you a couple pints for sure! I reckon by tomorrow, given what you pointed me to, we'll get it to successfully deploy (and work) 🍻

I've pushed some changes that should help. There is a fix for the wait task that would be necessary if you modify reductionist_host. There are also some docs changes that provide more info about how your hosts need to be setup, with required ports etc.

valeriupredoi commented 6 months ago

Mark, you're a bloody wizard! I took the latest changes you made here, opened port 8081 on active2 and active3, and the thing ran with absolutely no hitch:

PLAY RECAP *****************************************************************************************************************************************
active2                    : ok=20   changed=1    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
active3                    : ok=20   changed=1    unreachable=0    failed=0    skipped=8    rescued=0    ignored=0   
activeh                    : ok=46   changed=1    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0

I'm actually gonna poke it about see what's what, but boy am I happy to see no fails and comms and certs not barking at me :grin: :beers:

valeriupredoi commented 6 months ago

we got "Hello world!" from a remote client (me laptop) to activeh

(base) valeriu@valeriu-PORTEGE-Z30-C:~$ curl -k https://192.171.169.xxx:8080/.well-known/reductionist-schema
Hello, world!

where 192... is activeh public facing IP, and biutiful Hello Worlds from activeh to the two active2 and active3 via backend IPs and ports 8081:

[vpredoi@activeh ~]$ curl -k https://192.168.3.xxx:8081/.well-known/reductionist-schema
Hello, world![vpredoi@activeh ~]$ curl -k https://192.168.3.xxx:8081/.well-known/reductionist-schema
Hello, world!

I am over the moon :smile:

valeriupredoi commented 6 months ago

just ran an actual PyActiveStorage test and it runs very well (let's not concern with the times just yet) - reductionist process running on each of the three computers: active, active2, and active3 :partying_face:

markgoddard commented 6 months ago

absolute legend @markgoddard 🍺 Very many thanks for this and your CH (continuous help) over the past couple days. One itty bitty mention I'd put in here for others not to struggle like me is to have the ssh connection from the main node (activeh in my case) not be init-ed with an eal of the ssh-agent ie eval $(ssh-agent -s) because that will result in ansibles needing a password be inputted, but it's not asking for it explicitly, and instead it hangs - this is what my lovely colleague @RosalynHatcher sorted me out with, me barely speaking any ssh. Apart from that, I owe you a couple pints, mate 🍺

I've added a note about the SSH agent issue. Will merge once CI goes green again. Thanks for trying out my changes!

valeriupredoi commented 6 months ago

:beers: