status-im / infra-shards

Infrastructure for Status fleets
https://github.com/status-im/nim-waku
0 stars 2 forks source link

add boot hosts #1

Closed yakimant closed 9 months ago

yakimant commented 10 months ago

Ansible issue

Not logged into Bitwarden: please run 'bw login', or 'bw unlock' and set the BW_SESSION environment variable first

Solved by:

bw login
bw unlock
export BW_SESSION=SMTH
yakimant commented 10 months ago

Current issues:

yakimant commented 10 months ago

SSH keys setup:

# module.boot.module.do-eu-amsterdam3[0].digitalocean_droplet.host["boot-01.do-ams3.boot.test"] will be created
...
      + ssh_keys             = [
          + "20671731",
        ]

# module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-01.ac-cn-hongkong-c.shards.test"] will be created
...
      + key_name                           = "jakubgs"
yakimant commented 10 months ago

For DO we can do it: https://github.com/status-im/infra-tf-digital-ocean/pull/1

For AC, looks like one key allowed only: https://registry.terraform.io/providers/aliyun/alicloud/latest/docs/resources/instance

As an alternative:

Proper solution was to change ansible role locally.

yakimant commented 10 months ago

sshguard4 should be configured by sshguard automatically, I guess.

It's failing with following in the logs:

sshguard: '/usr/lib/x86_64-linux-gnu/sshg-fw-ipset' is not executable

Need to investigate the logic: https://github.com/status-im/infra-role-bootstrap-linux/blob/827e55412990026ad43756bd11f2cb698bdea622/templates/sshguard/sshguard.conf.j2#L3-L7

yakimant commented 9 months ago
yakimant commented 9 months ago

New issues:

yakimant commented 9 months ago

whiptail is for dialogs, probably it's waiting for some input

yakimant commented 9 months ago

Looks like needrestart should be setup for non-interactive ansible:

yakimant commented 9 months ago

Other issues

yakimant commented 9 months ago

netdata.service is not installed:

# /opt/netdata.gz.run --accept --target /opt/netdata -- --dont-wait --dont-start-it --disable-https --disable-cloud --disable-telemetry
...
 --- Install netdata at system init ---
ERROR: Failed to detect what type of service manager is in use.
/opt/netdata/usr/libexec/netdata/install-service.sh: 640: install_detect_service: not found
 --- Install (but not enable) netdata updater tool ---
cat: /system/netdata-updater.timer: No such file or directory
cat: /system/netdata-updater.service: No such file or directory
Update script is located at /opt/netdata/usr/libexec/netdata/netdata-updater.sh
...

Unfornutely it doesn't fail the installation.

yakimant commented 9 months ago

This code fails to detect systemd: https://github.com/netdata/netdata/blob/92515e41a52344fb1d346df5b54b953cb9de5055/system/install-service.sh.in#L182-L215

One of the issues:

# readlink /proc/1/exe
/usr/lib/systemd/systemd (deleted)

Note (deleted) in the end. Probably restart should help.

Second is probably in installer code itself - safe_pidof is not available.

yakimant commented 9 months ago

I don't know, why it is even looks at this file, here is wakuv2.shards for example:

❯ ansible all -i ansible/inventory/shards -a 'grep jammy-backports /etc/apt/sources.list'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse

❯ ansible all -i ansible/inventory/shards -a 'ls /var/lib/apt/lists/*_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.digitalocean.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/us-central1.gce.archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.cloud.aliyuncs.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64

Looks like they reverence the cloud specific repo mirrors.

jakubgs commented 9 months ago

I don't get your issue with Netdata, the _check_systemd function works fine:

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % head -n20 test.sh
#!/usr/bin/env sh
. ./functions.sh

_check_systemd() {
  pids=''
  p=''
  myns=''
  ns=''

  # if the directory /lib/systemd/system OR /usr/lib/systemd/system (SLES 12.x) does not exit, it is not systemd
  if [ ! -d /lib/systemd/system ] && [ ! -d /usr/lib/systemd/system ]; then
    echo "NO" && return 0
  fi

  # if there is no systemctl command, it is not systemd
  [ -z "$(command -v systemctl 2>/dev/null || true)" ] && echo "NO" && return 0

  # if pid 1 is systemd, it is systemd
  [ "$(basename "$(readlink /proc/1/exe)" 2> /dev/null)" = "systemd" ] && echo "YES" && return 0

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % ./test.sh 
YES

Seems like something else is at play. Maybe just an upgrade will help, not sure tho.

jakubgs commented 9 months ago

Also, it seems like now Netdata has its own ubuntu repository we could use:

So maybe the best thing would be to ditch the shitty installer and just use their repo.

Although one disadvantage of that is that pinning a version is harder. But it does appear they provide multiple versions.

yakimant commented 9 months ago

I don't get your issue with Netdata, the _check_systemd function works fine:

I think they don't import functions.sh and pids=$(safe_pidof systemd 2> /dev/null) silently fails.

yakimant commented 9 months ago

fuser /var/cache/fwupd/noneexist

Specified filename /var/cache/fwupd/noneexist does not exist.

echo $?

1

fuser /var/cache/apt/archives/lock

echo $?

1



I added debug and I can see some messages like:
Specified filename /var/lib/apt/lists/lock* does not exist.

So probably this code will not work as intended in some cases, when lock file doesn't exist (yet?):
https://github.com/status-im/infra-role-bootstrap-linux/blob/109e157f8d66a61981c23cd6e006d950fe75efe2/raw/tasks/main.yml#L9
It will exit the loop as if no locks are open.
yakimant commented 9 months ago

Command to reproduce:

ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Fix: Install on the controller node:

pip install jmespath

Follow-up: Add it to the setup documentation or requirements.txt / poetry project to each fleet repo.

yakimant commented 9 months ago

Alibaba Cloud images:

shards.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.shards:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230208.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_20_04_x64_20G_alibase_20200914.vhd"

Looks ok, although old hosts needs to be upgraded to 22.04 at some point.

yakimant commented 9 months ago

More on the netdata installation:

They even have a community supported playbook: https://learn.netdata.cloud/docs/installing/install-with-a-cicd-provisioning-system/ansible which runs the kickstart.sh script which will likely install deb from a repo.

The most popular role from Galaxy: https://github.com/mrlesmithjr/ansible-netdata/ runs netdata-installer.sh

Why they are so obsessed with installer scripts?

yakimant commented 9 months ago

Reproduced:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Workaround: Rerun Ansible without recreating an instance.

yakimant commented 9 months ago

Which is weird, because Ansible runs ssh with -o StrictHostKeyChecking=no, which should not check the fingerprint.

yakimant commented 9 months ago

Sometimes I see the issue, which is not failing Ansible:

 TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
 changed: [8.218.174.108] => {
     "changed": true,
     "rc": 0
 }

 STDERR:

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

 Someone could be eavesdropping on you right now (man-in-the-middle attack)!

 It is also possible that a host key has just been changed.

 The fingerprint for the ED25519 key sent by the remote host is
 SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.

 Please contact your system administrator.

 Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.

 Offending ED25519 key in /Users/status/.ssh/known_hosts:169

 Agent forwarding is disabled to avoid man-in-the-middle attacks.

 UpdateHostkeys is disabled because the host key is not trusted.

 Shared connection to 8.218.174.108 closed.
yakimant commented 9 months ago

Didn't reproduce the netdata issue with:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-01.ac-cn-hongkong-c.shards.test"]'

Need to double check with recreation of instance.

# /opt/netdata/usr/libexec/netdata/install-service.sh --show-type
Detected platform: Linux
Detected service managers:
  - systemd: YES
  - openrc: NO
  - lsb: NO
  - initd: NO
  - runit: NO
Would use systemd service management.
# readlink /proc/1/exe
/usr/lib/systemd/systemd

No (deleted), so parsed properly.

yakimant commented 9 months ago

Caught the /var/lib/dpkg/lock-frontend issue:

TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       6815 F.... unattended-upgr
│ /var/lib/dpkg/lock-frontend:
│                      root       6815 F.... unattended-upgr
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
│ unattende 6815 root    8uW  REG  252,3        0 1786 /var/lib/dpkg/lock-frontend
│ unattende 6815 root  114uW  REG  252,3        0 1665 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Install SSHGuard package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 6815 (unattended-upgr)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

So it's /usr/bin/unattended-upgrades proccess.

yakimant commented 9 months ago

Caught again:

│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       7191 F.... apt-get
│ /var/lib/dpkg/lock-frontend:
│                      root       7191 F.... apt-get
│ /var/cache/apt/archives/lock:
│                      root       7191 F.... apt-get
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
│ apt-get 7191 root    4uW  REG  252,1        0 71761 /var/lib/dpkg/lock-frontend
│ apt-get 7191 root    5uW  REG  252,1        0 71762 /var/lib/dpkg/lock
│ apt-get 7191 root    6uW  REG  252,1        0 69132 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Docker | Install package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 7191 (apt-get)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

apt-get this time

yakimant commented 9 months ago

Reproduce:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.gc-us-central1-a[0].google_compute_instance.host["boot-01.gc-us-central1-a.shards.test"]' -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Maybe we need to wait a bit for instance fully available via ssh.

Workaround: Ansible rerun helps

jakubgs commented 9 months ago

I think you are really overthinking this. The sleep in the first task in bootstrap is there for a reason.

I think you should stop trying to fix alibaba nonsense locking for now. Is probably just because their bootstrap doesn't finish because the instance you're using is too slow.

jakubgs commented 9 months ago

Also, I would recommend keeping research like this in the issue, and not in the PR.

yakimant commented 9 months ago

Yeah, I stoped investigating the non-blocking issues as we agreed yesterday. I just post whatever issues I encounter and rerun Ansible, which helps so far.

yakimant commented 9 months ago

Reproduced on the 2nd run after instance created:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Rerun didn't help.

Need to add keys to admin user: https://github.com/status-im/infra-role-bootstrap-linux/pull/34

yakimant commented 9 months ago

I will create proper Issues afterwards as a follow-up.

yakimant commented 9 months ago

Will revert to 75fa7e483cacccb482c99afddc7de3c25fb8a1fc in requirements for now

yakimant commented 9 months ago

probably because no other nodes are started, will setup others now

yakimant commented 9 months ago

[DEBUG] RPC Call URL: http://localhost:8545 [DEBUG] RPC Call Payload: {'method': 'post_waku_v2_admin_v1_peers', 'params': [['/dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31']], 'jsonrpc': '2.0', 'id': 0} Traceback (most recent call last): File "/usr/local/bin/connect_waku_peers.py", line 154, in main() File "/usr/local/bin/connect_waku_peers.py", line 142, in main raise Exception('RPC Error: %s' % rval['error']) Exception: RPC Error: {'code': -32000, 'message': 'post_waku_v2_admin_v1_peers raised an exception', 'data': 'Failed to connect to peer at index: 0 /dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31'}



Disabled the role, but probably it will fire up later
yakimant commented 9 months ago

The error appears to be in '/Users/status/.ansible/roles/open-ports/tasks/main.yml': line 20, column 5, but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

loop_control: label: ^ here



https://github.com/status-im/infra-role-open-ports/blob/24dc30dbdf85e6758cb6924074b2f7a0f4541524/tasks/main.yml#L19-L23

Removed loop_control as a workaround
yakimant commented 9 months ago

TASK [nim-waku : Save generate node key to file] ***** skipping: [boot-01.do-ams3.shards.test] => { "changed": false, "false_condition": "not key_file.stat.exists", "skip_reason": "Conditional result was False" }

TASK [nim-waku : Load existing node key from file] *** skipping: [boot-01.do-ams3.shards.test] => { "changed": false, "false_condition": "key_generation.skipped is not defined and nim_waku_node_key is not defined\n", "skip_reason": "Conditional result was False" }

TASK [nim-waku : Extract the node key from file] ***** fatal: [boot-01.do-ams3.shards.test]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'content'. 'dict object' has no attribute 'content'

The error appears to be in '/Users/status/.ansible/roles/nim-waku/tasks/nodekey.yml': line 39, column 3, but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

Load is skipped wrongly, because generation is skipped. Maybe it should be the opposite? If generation is skipped - load from file.

https://github.com/status-im/infra-role-nim-waku/blob/75fa7e483cacccb482c99afddc7de3c25fb8a1fc/tasks/nodekey.yml#L31-L37

yakimant commented 9 months ago

to debug / catch the lock issues, I was adding:

name: check locks (fuser)
raw: |
  sudo fuser --verbose /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

name: check locks (lsof)
raw: |
  sudo lsof /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

before bootstrap/raw and apt commands.

Also, I think apt has the ability to wait for locks, but not the package. Will check in the related issue.

yakimant commented 9 months ago

Potential temporary workaround for netdata: copy /opt/netdata/system/netdata.service to /lib/systemd/system/netdata.service

yakimant commented 9 months ago

This PR is closed in a favour of these 3 as requested by @jakubgs:

The following issues were discovered during the work on this PR: