status-im / infra-shards

Infrastructure for Status fleets
0 stars 2 forks source link

add boot hosts #1

Closed yakimant closed 9 months ago

yakimant commented 10 months ago

Ansible issue

Not logged into Bitwarden: please run 'bw login', or 'bw unlock' and set the BW_SESSION environment variable first

Solved by:

bw login
bw unlock
yakimant commented 10 months ago

Current issues:

yakimant commented 10 months ago

SSH keys setup:

#[0][""] will be created
      + ssh_keys             = [
          + "20671731",

#[0][""] will be created
      + key_name                           = "jakubgs"
yakimant commented 10 months ago

For DO we can do it:

For AC, looks like one key allowed only:

As an alternative:

Proper solution was to change ansible role locally.

yakimant commented 10 months ago

sshguard4 should be configured by sshguard automatically, I guess.

It's failing with following in the logs:

sshguard: '/usr/lib/x86_64-linux-gnu/sshg-fw-ipset' is not executable

Need to investigate the logic:

yakimant commented 9 months ago
yakimant commented 9 months ago

New issues:

yakimant commented 9 months ago

whiptail is for dialogs, probably it's waiting for some input

yakimant commented 9 months ago

Looks like needrestart should be setup for non-interactive ansible:

yakimant commented 9 months ago

Other issues

yakimant commented 9 months ago

netdata.service is not installed:

# /opt/ --accept --target /opt/netdata -- --dont-wait --dont-start-it --disable-https --disable-cloud --disable-telemetry
 --- Install netdata at system init ---
ERROR: Failed to detect what type of service manager is in use.
/opt/netdata/usr/libexec/netdata/ 640: install_detect_service: not found
 --- Install (but not enable) netdata updater tool ---
cat: /system/netdata-updater.timer: No such file or directory
cat: /system/netdata-updater.service: No such file or directory
Update script is located at /opt/netdata/usr/libexec/netdata/

Unfornutely it doesn't fail the installation.

yakimant commented 9 months ago

This code fails to detect systemd:

One of the issues:

# readlink /proc/1/exe
/usr/lib/systemd/systemd (deleted)

Note (deleted) in the end. Probably restart should help.

Second is probably in installer code itself - safe_pidof is not available.

yakimant commented 9 months ago

I don't know, why it is even looks at this file, here is wakuv2.shards for example:

❯ ansible all -i ansible/inventory/shards -a 'grep jammy-backports /etc/apt/sources.list' | CHANGED | rc=0 >>
deb jammy-backports main restricted universe multiverse
# deb-src jammy-backports main restricted universe multiverse
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
deb jammy-backports main restricted universe multiverse
# deb-src jammy-backports main restricted universe multiverse | CHANGED | rc=0 >>
deb jammy-backports main restricted universe multiverse
deb-src jammy-backports main restricted universe multiverse

❯ ansible all -i ansible/inventory/shards -a 'ls /var/lib/apt/lists/*_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64' | CHANGED | rc=0 >>
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/us-central1.gce.archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64 | CHANGED | rc=0 >>

Looks like they reverence the cloud specific repo mirrors.

jakubgs commented 9 months ago

I don't get your issue with Netdata, the _check_systemd function works fine: % head -n20
#!/usr/bin/env sh
. ./

_check_systemd() {

  # if the directory /lib/systemd/system OR /usr/lib/systemd/system (SLES 12.x) does not exit, it is not systemd
  if [ ! -d /lib/systemd/system ] && [ ! -d /usr/lib/systemd/system ]; then
    echo "NO" && return 0

  # if there is no systemctl command, it is not systemd
  [ -z "$(command -v systemctl 2>/dev/null || true)" ] && echo "NO" && return 0

  # if pid 1 is systemd, it is systemd
  [ "$(basename "$(readlink /proc/1/exe)" 2> /dev/null)" = "systemd" ] && echo "YES" && return 0 % ./ 

Seems like something else is at play. Maybe just an upgrade will help, not sure tho.

jakubgs commented 9 months ago

Also, it seems like now Netdata has its own ubuntu repository we could use:

So maybe the best thing would be to ditch the shitty installer and just use their repo.

Although one disadvantage of that is that pinning a version is harder. But it does appear they provide multiple versions.

yakimant commented 9 months ago

I don't get your issue with Netdata, the _check_systemd function works fine:

I think they don't import and pids=$(safe_pidof systemd 2> /dev/null) silently fails.

yakimant commented 9 months ago

fuser /var/cache/fwupd/noneexist

Specified filename /var/cache/fwupd/noneexist does not exist.

echo $?


fuser /var/cache/apt/archives/lock

echo $?


I added debug and I can see some messages like:
Specified filename /var/lib/apt/lists/lock* does not exist.

So probably this code will not work as intended in some cases, when lock file doesn't exist (yet?):
It will exit the loop as if no locks are open.
yakimant commented 9 months ago

Command to reproduce:

ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='[0][""]' -target='[0][""]'

Fix: Install on the controller node:

pip install jmespath

Follow-up: Add it to the setup documentation or requirements.txt / poetry project to each fleet repo.

yakimant commented 9 months ago

Alibaba Cloud images:


❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"


❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"


❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230208.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_20_04_x64_20G_alibase_20200914.vhd"

Looks ok, although old hosts needs to be upgraded to 22.04 at some point.

yakimant commented 9 months ago

More on the netdata installation:

They even have a community supported playbook: which runs the script which will likely install deb from a repo.

The most popular role from Galaxy: runs

Why they are so obsessed with installer scripts?

yakimant commented 9 months ago


ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='[0][""]' -target='[0][""]'

Workaround: Rerun Ansible without recreating an instance.

yakimant commented 9 months ago

Which is weird, because Ansible runs ssh with -o StrictHostKeyChecking=no, which should not check the fingerprint.

yakimant commented 9 months ago

Sometimes I see the issue, which is not failing Ansible:

 TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
 changed: [] => {
     "changed": true,
     "rc": 0






 Someone could be eavesdropping on you right now (man-in-the-middle attack)!

 It is also possible that a host key has just been changed.

 The fingerprint for the ED25519 key sent by the remote host is

 Please contact your system administrator.

 Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.

 Offending ED25519 key in /Users/status/.ssh/known_hosts:169

 Agent forwarding is disabled to avoid man-in-the-middle attacks.

 UpdateHostkeys is disabled because the host key is not trusted.

 Shared connection to closed.
yakimant commented 9 months ago

Didn't reproduce the netdata issue with:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='[0][""]'

Need to double check with recreation of instance.

# /opt/netdata/usr/libexec/netdata/ --show-type
Detected platform: Linux
Detected service managers:
  - systemd: YES
  - openrc: NO
  - lsb: NO
  - initd: NO
  - runit: NO
Would use systemd service management.
# readlink /proc/1/exe

No (deleted), so parsed properly.

yakimant commented 9 months ago

Caught the /var/lib/dpkg/lock-frontend issue:

TASK [infra-role-bootstrap-linux : check locks] ********************************
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       6815 F.... unattended-upgr
│ /var/lib/dpkg/lock-frontend:
│                      root       6815 F.... unattended-upgr
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
│ unattende 6815 root    8uW  REG  252,3        0 1786 /var/lib/dpkg/lock-frontend
│ unattende 6815 root  114uW  REG  252,3        0 1665 /var/cache/apt/archives/lock
│ TASK [infra-role-bootstrap-linux : Install SSHGuard package] *******************
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 6815 (unattended-upgr)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

So it's /usr/bin/unattended-upgrades proccess.

yakimant commented 9 months ago

Caught again:

│ TASK [infra-role-bootstrap-linux : check locks] ********************************
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       7191 F.... apt-get
│ /var/lib/dpkg/lock-frontend:
│                      root       7191 F.... apt-get
│ /var/cache/apt/archives/lock:
│                      root       7191 F.... apt-get
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
│ apt-get 7191 root    4uW  REG  252,1        0 71761 /var/lib/dpkg/lock-frontend
│ apt-get 7191 root    5uW  REG  252,1        0 71762 /var/lib/dpkg/lock
│ apt-get 7191 root    6uW  REG  252,1        0 69132 /var/cache/apt/archives/lock
│ TASK [infra-role-bootstrap-linux : Docker | Install package] *******************
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 7191 (apt-get)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

apt-get this time

yakimant commented 9 months ago


ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.gc-us-central1-a[0]["boot-01.gc-us-central1-a.shards.test"]' -target='module.boot.module.gc-us-central1-a[0]["boot-01.gc-us-central1-a.shards.test"]'

Maybe we need to wait a bit for instance fully available via ssh.

Workaround: Ansible rerun helps

jakubgs commented 9 months ago

I think you are really overthinking this. The sleep in the first task in bootstrap is there for a reason.

I think you should stop trying to fix alibaba nonsense locking for now. Is probably just because their bootstrap doesn't finish because the instance you're using is too slow.

jakubgs commented 9 months ago

Also, I would recommend keeping research like this in the issue, and not in the PR.

yakimant commented 9 months ago

Yeah, I stoped investigating the non-blocking issues as we agreed yesterday. I just post whatever issues I encounter and rerun Ansible, which helps so far.

yakimant commented 9 months ago

Reproduced on the 2nd run after instance created:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.gc-us-central1-a[0]["boot-01.gc-us-central1-a.shards.test"]'

Rerun didn't help.

Need to add keys to admin user:

yakimant commented 9 months ago

I will create proper Issues afterwards as a follow-up.

yakimant commented 9 months ago

Will revert to 75fa7e483cacccb482c99afddc7de3c25fb8a1fc in requirements for now

yakimant commented 9 months ago

probably because no other nodes are started, will setup others now

yakimant commented 9 months ago

[DEBUG] RPC Call URL: http://localhost:8545 [DEBUG] RPC Call Payload: {'method': 'post_waku_v2_admin_v1_peers', 'params': [['/dns4/', '/dns4/', '/dns4/', '/dns4/', '/dns4/']], 'jsonrpc': '2.0', 'id': 0} Traceback (most recent call last): File "/usr/local/bin/", line 154, in main() File "/usr/local/bin/", line 142, in main raise Exception('RPC Error: %s' % rval['error']) Exception: RPC Error: {'code': -32000, 'message': 'post_waku_v2_admin_v1_peers raised an exception', 'data': 'Failed to connect to peer at index: 0 /dns4/'}

Disabled the role, but probably it will fire up later
yakimant commented 9 months ago

The error appears to be in '/Users/status/.ansible/roles/open-ports/tasks/main.yml': line 20, column 5, but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

loop_control: label: ^ here

Removed loop_control as a workaround
yakimant commented 9 months ago

TASK [nim-waku : Save generate node key to file] ***** skipping: [] => { "changed": false, "false_condition": "not key_file.stat.exists", "skip_reason": "Conditional result was False" }

TASK [nim-waku : Load existing node key from file] *** skipping: [] => { "changed": false, "false_condition": "key_generation.skipped is not defined and nim_waku_node_key is not defined\n", "skip_reason": "Conditional result was False" }

TASK [nim-waku : Extract the node key from file] ***** fatal: []: FAILED! => {}


The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'content'. 'dict object' has no attribute 'content'

The error appears to be in '/Users/status/.ansible/roles/nim-waku/tasks/nodekey.yml': line 39, column 3, but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

Load is skipped wrongly, because generation is skipped. Maybe it should be the opposite? If generation is skipped - load from file.

yakimant commented 9 months ago

to debug / catch the lock issues, I was adding:

name: check locks (fuser)
raw: |
  sudo fuser --verbose /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

name: check locks (lsof)
raw: |
  sudo lsof /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

before bootstrap/raw and apt commands.

Also, I think apt has the ability to wait for locks, but not the package. Will check in the related issue.

yakimant commented 9 months ago

Potential temporary workaround for netdata: copy /opt/netdata/system/netdata.service to /lib/systemd/system/netdata.service

yakimant commented 9 months ago

This PR is closed in a favour of these 3 as requested by @jakubgs:

The following issues were discovered during the work on this PR: