yakimant commented 1 year ago

Ansible issue

Not logged into Bitwarden: please run 'bw login', or 'bw unlock' and set the BW_SESSION environment variable first

Solved by:

bw login
bw unlock
export BW_SESSION=SMTH

yakimant commented 1 year ago

Current issues:

[x] SSH access issues for DO and AC, but not GC
[x] iptables issue: Set sshguard4 doesn't exist
- not an issue after ansible/requirements.yml update

yakimant commented 1 year ago

SSH keys setup:

# module.boot.module.do-eu-amsterdam3[0].digitalocean_droplet.host["boot-01.do-ams3.boot.test"] will be created
...
      + ssh_keys             = [
          + "20671731",
        ]

# module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-01.ac-cn-hongkong-c.shards.test"] will be created
...
      + key_name                           = "jakubgs"

yakimant commented 1 year ago

For DO we can do it: https://github.com/status-im/infra-tf-digital-ocean/pull/1

For AC, looks like one key allowed only: https://registry.terraform.io/providers/aliyun/alicloud/latest/docs/resources/instance

As an alternative:

~~we could configure it as variable in infra-tf-multi-provider, so each devops override it locally~~ this will not work, key pair is once and forever
have a share ssh key (security risks)
run ansible only on CI with shared key

Proper solution was to change ansible role locally.

yakimant commented 1 year ago

sshguard4 should be configured by sshguard automatically, I guess.

It's failing with following in the logs:

sshguard: '/usr/lib/x86_64-linux-gnu/sshg-fw-ipset' is not executable

Need to investigate the logic: https://github.com/status-im/infra-role-bootstrap-linux/blob/827e55412990026ad43756bd11f2cb698bdea622/templates/sshguard/sshguard.conf.j2#L3-L7

yakimant commented 1 year ago

https://github.com/status-im/infra-tf-digital-ocean/pull/1 is merged
For AC ssh keys issue is fixed by editing .terraform/modules/boot.ac-cn-hongkong-c/variables.tf

yakimant commented 1 year ago

New issues:

[x] AC: infra-role-bootstrap-linux : Make sure essential pip packages are installed TAGS: [role::bootstrap:packages] fails on with AttributeError: cython_sources
- https://github.com/yaml/pyyaml/issues/601
- Option 1: Downgrade PyYAML to 5.3.1 (from 5.4.1)
- Option 2: Official workaroung: pip install "Cython<3.0" && pip install "PyYAML==5.4.1" --no-build-isolation: https://github.com/status-im/infra-role-bootstrap-linux/pull/33
- Option 3: Investigate, why it doesn't reproduce on GC and DO

[x] DO: infra-role-bootstrap-linux/raw : Install mandatory packages runs endlessly

Stuck on "Scanning processes" phase of apt install:

apt -y install python3-minimal acl
`-apt -y install python3-minimal acl
`-sh -c test -x /usr/lib/needrestart/apt-pinvoke && /usr/lib/needrestart/apt-pinvoke || true
    `-frontend -w /usr/share/debconf/frontend /usr/sbin/needrestart
        |-needrestart /usr/sbin/needrestart
        `-whiptail --backtitle Package configuration --title Daemons using outdated libraries --output-fd 11 --separate-output --checklist \012\012Which services should be restarted? 12 47 2 -- packagekit.service  on unattended-upgrades.service

https://github.com/status-im/infra-role-bootstrap-linux/pull/32

yakimant commented 1 year ago

whiptail is for dialogs, probably it's waiting for some input

yakimant commented 1 year ago

Looks like needrestart should be setup for non-interactive ansible:

yakimant commented 1 year ago

Other issues

[ ] infra-role-bootstrap-linux : Docker | Install package failing with:
- ```
'/usr/bin/apt-get -y -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"       install 'docker-ce=5:24.0.6-1~ubuntu.22.04~jammy' 'docker-compose=1.29.2-1'' failed: E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 13119 (apt-get)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
```
- rerun helps
- also infra-role-wireguard : Install WireGuard packages on AC
- Links:
- https://github.com/ansible/ansible/issues/25414
- https://joelvasallo.com/packer-ansible-unable-to-acquire-dpkg-lock-c7eb5863127d
- https://github.com/ansible/ansible/issues/51663
- Possible solutions:
- lock_timeout apt option (since 2.12)
- checking lock with lsof or fuser
- pgrep with unatended or other proccesses (bad)
- Official before lock_timeout
```
register: apt_action
retries: 100
until: apt_action is success or ('Failed to lock apt for exclusive operation' not in apt_action.msg and '/var/lib/dpkg/lock' not in apt_action.msg)
```
- system updates to finish: systemd-run --property="After=apt-daily.service apt-daily-upgrade.service" --wait /bin/true

[x] infra-role-bootstrap-linux : Consul | Create consul config directory fails with

AnsibleError: An unhandled exception occurred while templating '{{lookup("bitwarden", "consul/cluster", field="encryption-key")}}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'bitwarden'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0). Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0)

bw unlock and export helped

[ ] infra-role-bootstrap-linux/raw : Install mandatory packages returned on DO:

E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_multiverse_cnf_Commands-amd64 - open (2: No such file or directory)

│ E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64 - open (2: No such file or directory)
...
│ E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'
│ E: Sub-process returned an error code

~~looks like a corrupted cache, probably rm -rf /var/lib/apt/lists/* && apt update should do the trick~~
retry helps

[ ] infra-role-bootstrap-linux : Netdata | Restart service:
- ```
Could not find the requested service netdata: host
```

yakimant commented 1 year ago

netdata.service is not installed:

# /opt/netdata.gz.run --accept --target /opt/netdata -- --dont-wait --dont-start-it --disable-https --disable-cloud --disable-telemetry
...
 --- Install netdata at system init ---
ERROR: Failed to detect what type of service manager is in use.
/opt/netdata/usr/libexec/netdata/install-service.sh: 640: install_detect_service: not found
 --- Install (but not enable) netdata updater tool ---
cat: /system/netdata-updater.timer: No such file or directory
cat: /system/netdata-updater.service: No such file or directory
Update script is located at /opt/netdata/usr/libexec/netdata/netdata-updater.sh
...

Unfornutely it doesn't fail the installation.

yakimant commented 1 year ago

This code fails to detect systemd: https://github.com/netdata/netdata/blob/92515e41a52344fb1d346df5b54b953cb9de5055/system/install-service.sh.in#L182-L215

One of the issues:

# readlink /proc/1/exe
/usr/lib/systemd/systemd (deleted)

Note (deleted) in the end. Probably restart should help.

Second is probably in installer code itself - safe_pidof is not available.

yakimant commented 1 year ago

I don't know, why it is even looks at this file, here is wakuv2.shards for example:

❯ ansible all -i ansible/inventory/shards -a 'grep jammy-backports /etc/apt/sources.list'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse

❯ ansible all -i ansible/inventory/shards -a 'ls /var/lib/apt/lists/*_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.digitalocean.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/us-central1.gce.archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.cloud.aliyuncs.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64

Looks like they reverence the cloud specific repo mirrors.

jakubgs commented 1 year ago

I don't get your issue with Netdata, the _check_systemd function works fine:

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % head -n20 test.sh
#!/usr/bin/env sh
. ./functions.sh

_check_systemd() {
  pids=''
  p=''
  myns=''
  ns=''

  # if the directory /lib/systemd/system OR /usr/lib/systemd/system (SLES 12.x) does not exit, it is not systemd
  if [ ! -d /lib/systemd/system ] && [ ! -d /usr/lib/systemd/system ]; then
    echo "NO" && return 0
  fi

  # if there is no systemctl command, it is not systemd
  [ -z "$(command -v systemctl 2>/dev/null || true)" ] && echo "NO" && return 0

  # if pid 1 is systemd, it is systemd
  [ "$(basename "$(readlink /proc/1/exe)" 2> /dev/null)" = "systemd" ] && echo "YES" && return 0

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % ./test.sh 
YES

Seems like something else is at play. Maybe just an upgrade will help, not sure tho.

jakubgs commented 1 year ago

Also, it seems like now Netdata has its own ubuntu repository we could use:

So maybe the best thing would be to ditch the shitty installer and just use their repo.

Although one disadvantage of that is that pinning a version is harder. But it does appear they provide multiple versions.

yakimant commented 1 year ago

I don't get your issue with Netdata, the _check_systemd function works fine:

I think they don't import functions.sh and pids=$(safe_pidof systemd 2> /dev/null) silently fails.

yakimant commented 1 year ago

[ ] Another minor issue: fuser exits with 1 if files not open by other cpu or one of the files doesn't exit:
```
# fuser /var/cache/fwupd/metadata.xmlb
/var/cache/fwupd/metadata.xmlb:  6080m
# echo $?
0
```

fuser /var/cache/fwupd/noneexist

Specified filename /var/cache/fwupd/noneexist does not exist.

echo $?

1

fuser /var/cache/apt/archives/lock

echo $?

1



I added debug and I can see some messages like:
Specified filename /var/lib/apt/lists/lock* does not exist.

So probably this code will not work as intended in some cases, when lock file doesn't exist (yet?):
https://github.com/status-im/infra-role-bootstrap-linux/blob/109e157f8d66a61981c23cd6e006d950fe75efe2/raw/tasks/main.yml#L9
It will exit the loop as if no locks are open.

yakimant commented 1 year ago

[x] You need to install "jmespath" prior to running json_query filter

│ TASK [infra-role-bootstrap-linux : Volume | Identify device without partitions] ***
│ fatal: [8.218.174.108]: FAILED! => {}
│
│ MSG:
│
│ You need to install "jmespath" prior to running json_query filter

Command to reproduce:

ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Fix: Install on the controller node:

pip install jmespath

Follow-up: Add it to the setup documentation or requirements.txt / poetry project to each fleet repo.

yakimant commented 1 year ago

Alibaba Cloud images:

shards.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.shards:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230208.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_20_04_x64_20G_alibase_20200914.vhd"

Looks ok, although old hosts needs to be upgraded to 22.04 at some point.

yakimant commented 1 year ago

More on the netdata installation:

They even have a community supported playbook: https://learn.netdata.cloud/docs/installing/install-with-a-cicd-provisioning-system/ansible which runs the kickstart.sh script which will likely install deb from a repo.

The most popular role from Galaxy: https://github.com/mrlesmithjr/ansible-netdata/ runs netdata-installer.sh

Why they are so obsessed with installer scripts?

yakimant commented 1 year ago

[ ] ssh fingerprint issue duting the 'role::bootstrap:users tasks. Can happen on different steps, eg:

TASK [infra-role-bootstrap-linux : Create users groups] ************************
│ fatal: [8.218.174.108]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "8.218.174.108". Make sure this host can be reached over ssh: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
│ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
│ Someone could be eavesdropping on you right now (man-in-the-middle attack)!
│ It is also possible that a host key has just been changed.
│ The fingerprint for the ED25519 key sent by the remote host is
│ SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.
│ Please contact your system administrator.
│ Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.
│ Offending ED25519 key in /Users/status/.ssh/known_hosts:169
│ Agent forwarding is disabled to avoid man-in-the-middle attacks.
│ UpdateHostkeys is disabled because the host key is not trusted.
│ root@8.218.174.108: Permission denied (publickey).

Reproduced:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Workaround: Rerun Ansible without recreating an instance.

yakimant commented 1 year ago

Which is weird, because Ansible runs ssh with -o StrictHostKeyChecking=no, which should not check the fingerprint.

yakimant commented 1 year ago

Sometimes I see the issue, which is not failing Ansible:

 TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
 changed: [8.218.174.108] => {
     "changed": true,
     "rc": 0
 }

 STDERR:

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

 Someone could be eavesdropping on you right now (man-in-the-middle attack)!

 It is also possible that a host key has just been changed.

 The fingerprint for the ED25519 key sent by the remote host is
 SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.

 Please contact your system administrator.

 Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.

 Offending ED25519 key in /Users/status/.ssh/known_hosts:169

 Agent forwarding is disabled to avoid man-in-the-middle attacks.

 UpdateHostkeys is disabled because the host key is not trusted.

 Shared connection to 8.218.174.108 closed.

yakimant commented 1 year ago

Didn't reproduce the netdata issue with:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-01.ac-cn-hongkong-c.shards.test"]'

Need to double check with recreation of instance.

# /opt/netdata/usr/libexec/netdata/install-service.sh --show-type
Detected platform: Linux
Detected service managers:
  - systemd: YES
  - openrc: NO
  - lsb: NO
  - initd: NO
  - runit: NO
Would use systemd service management.
# readlink /proc/1/exe
/usr/lib/systemd/systemd

No (deleted), so parsed properly.

yakimant commented 1 year ago

Caught the /var/lib/dpkg/lock-frontend issue:

TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       6815 F.... unattended-upgr
│ /var/lib/dpkg/lock-frontend:
│                      root       6815 F.... unattended-upgr
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
│ unattende 6815 root    8uW  REG  252,3        0 1786 /var/lib/dpkg/lock-frontend
│ unattende 6815 root  114uW  REG  252,3        0 1665 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Install SSHGuard package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 6815 (unattended-upgr)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

So it's /usr/bin/unattended-upgrades proccess.

yakimant commented 1 year ago

Caught again:

│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       7191 F.... apt-get
│ /var/lib/dpkg/lock-frontend:
│                      root       7191 F.... apt-get
│ /var/cache/apt/archives/lock:
│                      root       7191 F.... apt-get
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
│ apt-get 7191 root    4uW  REG  252,1        0 71761 /var/lib/dpkg/lock-frontend
│ apt-get 7191 root    5uW  REG  252,1        0 71762 /var/lib/dpkg/lock
│ apt-get 7191 root    6uW  REG  252,1        0 69132 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Docker | Install package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 7191 (apt-get)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

apt-get this time

yakimant commented 1 year ago

[ ] ssh connection refused on GC:

│ TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Failed to connect to the host via ssh: ssh: connect to host 34.135.13.87 port 22: Connection refused

Reproduce:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.gc-us-central1-a[0].google_compute_instance.host["boot-01.gc-us-central1-a.shards.test"]' -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Maybe we need to wait a bit for instance fully available via ssh.

Workaround: Ansible rerun helps

jakubgs commented 1 year ago

I think you are really overthinking this. The sleep in the first task in bootstrap is there for a reason.

I think you should stop trying to fix alibaba nonsense locking for now. Is probably just because their bootstrap doesn't finish because the instance you're using is too slow.

jakubgs commented 1 year ago

Also, I would recommend keeping research like this in the issue, and not in the PR.

yakimant commented 1 year ago

Yeah, I stoped investigating the non-blocking issues as we agreed yesterday. I just post whatever issues I encounter and rerun Ansible, which helps so far.

yakimant commented 1 year ago

[ ] ssh Permission denied issue during the role::bootstrap:users tasks on GC

│ TASK [infra-role-bootstrap-linux : Kill ubuntu user processes] *****************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "34.135.13.87". Make sure this host can be reached over ssh: admin@34.135.13.87: Permission denied (publickey).

Reproduced on the 2nd run after instance created:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Rerun didn't help.

Need to add keys to admin user: https://github.com/status-im/infra-role-bootstrap-linux/pull/34

yakimant commented 1 year ago

I will create proper Issues afterwards as a follow-up.

yakimant commented 1 year ago

[ ] Trying to find an image, which supports this change: https://github.com/status-im/infra-role-nim-waku/commit/0de15086cd763305419d637415d7a3b5200e0cb8
statusteam/nim-waku:deploy-wakuv2-shards

statusteam/nim-waku:deploy-wakuv2-test doesn't support:

Unrecognized option 'pubsub-topic'
Try wakunode2 --help for more information.

Will revert to 75fa7e483cacccb482c99afddc7de3c25fb8a1fc in requirements for now

yakimant commented 1 year ago

[x] waku-peers fails to start:

$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug
[DEBUG] Connecting to Consul: localhost:8500
[INFO] Found 5 data centers.
[DEBUG] Querying: nim-waku (dc=do-ams3, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Found: boot-01.do-ams3.shards.test (env:shards,stage:test,nim,waku,libp2p)
[DEBUG] Querying: nim-waku (dc=aws-eu-central-1a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=he-eu-hel1, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=gc-us-central1-a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=ac-cn-hongkong-c, node_meta={'env': 'shards', 'stage': 'test'})
[INFO] Found 0 services.
Traceback (most recent call last):
File "/usr/local/bin/connect_waku_peers.py", line 154, in <module>
main()
File "/usr/local/bin/connect_waku_peers.py", line 125, in main
raise Exception('No services found')
Exception: No services found

~~probably~~ because no other nodes are started, will setup others now

yakimant commented 1 year ago

[x] another issue with waku-peers:


$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug

[DEBUG] RPC Call URL: http://localhost:8545 [DEBUG] RPC Call Payload: {'method': 'post_waku_v2_admin_v1_peers', 'params': [['/dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31']], 'jsonrpc': '2.0', 'id': 0} Traceback (most recent call last): File "/usr/local/bin/connect_waku_peers.py", line 154, in main() File "/usr/local/bin/connect_waku_peers.py", line 142, in main raise Exception('RPC Error: %s' % rval['error']) Exception: RPC Error: {'code': -32000, 'message': 'post_waku_v2_admin_v1_peers raised an exception', 'data': 'Failed to connect to peer at index: 0 /dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31'}



Disabled the role, but probably it will fire up later

yakimant commented 1 year ago

[x] loop_control issues, label should be a string:


❯ ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "open-ports" -i ansible/inventory/test -v
Using /Users/status/work/infra-shards/ansible.cfg as config file
ERROR! The field 'label' is supposed to be a string type, however the incoming data structure is a <class 'ansible.parsing.yaml.objects.AnsibleMapping'>

The error appears to be in '/Users/status/.ansible/roles/open-ports/tasks/main.yml': line 20, column 5, but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

loop_control: label: ^ here



https://github.com/status-im/infra-role-open-ports/blob/24dc30dbdf85e6758cb6924074b2f7a0f4541524/tasks/main.yml#L19-L23

Removed loop_control as a workaround

yakimant commented 1 year ago

[ ] nim_waku_node_key extraction from file files if already created and not setup by variable


ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "nim-waku" -i ansible/inventory/test -v
...
TASK [nim-waku : Generate random node key] ***********************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
"changed": false,
"false_condition": "not key_file.stat.exists and nim_waku_node_key is not defined\n",
"skip_reason": "Conditional result was False"
}

TASK [nim-waku : Save generate node key to file] ***** skipping: [boot-01.do-ams3.shards.test] => { "changed": false, "false_condition": "not key_file.stat.exists", "skip_reason": "Conditional result was False" }

TASK [nim-waku : Load existing node key from file] *** skipping: [boot-01.do-ams3.shards.test] => { "changed": false, "false_condition": "key_generation.skipped is not defined and nim_waku_node_key is not defined\n", "skip_reason": "Conditional result was False" }

TASK [nim-waku : Extract the node key from file] ***** fatal: [boot-01.do-ams3.shards.test]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'content'. 'dict object' has no attribute 'content'

The error appears to be in '/Users/status/.ansible/roles/nim-waku/tasks/nodekey.yml': line 39, column 3, but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

name: Extract the node key from file ^ here

Load is skipped wrongly, because generation is skipped. Maybe it should be the opposite? If generation is skipped - load from file.

https://github.com/status-im/infra-role-nim-waku/blob/75fa7e483cacccb482c99afddc7de3c25fb8a1fc/tasks/nodekey.yml#L31-L37

yakimant commented 1 year ago

to debug / catch the lock issues, I was adding:

name: check locks (fuser)
raw: |
  sudo fuser --verbose /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

name: check locks (lsof)
raw: |
  sudo lsof /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

before bootstrap/raw and apt commands.

Also, I think apt has the ability to wait for locks, but not the package. Will check in the related issue.

yakimant commented 1 year ago

Potential temporary workaround for netdata: copy /opt/netdata/system/netdata.service to /lib/systemd/system/netdata.service

yakimant commented 1 year ago

This PR is closed in a favour of these 3 as requested by @jakubgs:

The following issues were discovered during the work on this PR:

status-im / infra-status

add boot hosts #1

fuser /var/cache/fwupd/noneexist

echo $?

fuser /var/cache/apt/archives/lock

echo $?