yakimant commented 10 months ago

Ansible issue

Not logged into Bitwarden: please run 'bw login', or 'bw unlock' and set the BW_SESSION environment variable first

Solved by:

bw login
bw unlock
export BW_SESSION=SMTH

yakimant commented 10 months ago

Current issues:

[x] SSH access issues for DO and AC, but not GC
[x] iptables issue: Set sshguard4 doesn't exist
- not an issue after ansible/requirements.yml update

yakimant commented 10 months ago

SSH keys setup:

# module.boot.module.do-eu-amsterdam3[0].digitalocean_droplet.host["boot-01.do-ams3.boot.test"] will be created
...
      + ssh_keys             = [
          + "20671731",
        ]

# module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-01.ac-cn-hongkong-c.shards.test"] will be created
...
      + key_name                           = "jakubgs"

yakimant commented 10 months ago

For DO we can do it: https://github.com/status-im/infra-tf-digital-ocean/pull/1

For AC, looks like one key allowed only: https://registry.terraform.io/providers/aliyun/alicloud/latest/docs/resources/instance

As an alternative:

~~we could configure it as variable in infra-tf-multi-provider, so each devops override it locally~~ this will not work, key pair is once and forever
have a share ssh key (security risks)
run ansible only on CI with shared key

Proper solution was to change ansible role locally.

yakimant commented 10 months ago

sshguard4 should be configured by sshguard automatically, I guess.

It's failing with following in the logs:

sshguard: '/usr/lib/x86_64-linux-gnu/sshg-fw-ipset' is not executable

Need to investigate the logic: https://github.com/status-im/infra-role-bootstrap-linux/blob/827e55412990026ad43756bd11f2cb698bdea622/templates/sshguard/sshguard.conf.j2#L3-L7

yakimant commented 9 months ago

https://github.com/status-im/infra-tf-digital-ocean/pull/1 is merged
For AC ssh keys issue is fixed by editing .terraform/modules/boot.ac-cn-hongkong-c/variables.tf

yakimant commented 9 months ago

New issues:

[x] AC: infra-role-bootstrap-linux : Make sure essential pip packages are installed TAGS: [role::bootstrap:packages] fails on with AttributeError: cython_sources
- https://github.com/yaml/pyyaml/issues/601
- Option 1: Downgrade PyYAML to 5.3.1 (from 5.4.1)
- Option 2: Official workaroung: pip install "Cython<3.0" && pip install "PyYAML==5.4.1" --no-build-isolation: https://github.com/status-im/infra-role-bootstrap-linux/pull/33
- Option 3: Investigate, why it doesn't reproduce on GC and DO

[x] DO: infra-role-bootstrap-linux/raw : Install mandatory packages runs endlessly

Stuck on "Scanning processes" phase of apt install:

apt -y install python3-minimal acl
`-apt -y install python3-minimal acl
`-sh -c test -x /usr/lib/needrestart/apt-pinvoke && /usr/lib/needrestart/apt-pinvoke || true
    `-frontend -w /usr/share/debconf/frontend /usr/sbin/needrestart
        |-needrestart /usr/sbin/needrestart
        `-whiptail --backtitle Package configuration --title Daemons using outdated libraries --output-fd 11 --separate-output --checklist \012\012Which services should be restarted? 12 47 2 -- packagekit.service  on unattended-upgrades.service

https://github.com/status-im/infra-role-bootstrap-linux/pull/32

yakimant commented 9 months ago

whiptail is for dialogs, probably it's waiting for some input

yakimant commented 9 months ago

Looks like needrestart should be setup for non-interactive ansible:

yakimant commented 9 months ago

Other issues

[ ] infra-role-bootstrap-linux : Docker | Install package failing with:
- ```
'/usr/bin/apt-get -y -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"       install 'docker-ce=5:24.0.6-1~ubuntu.22.04~jammy' 'docker-compose=1.29.2-1'' failed: E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 13119 (apt-get)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
```
- rerun helps
- also infra-role-wireguard : Install WireGuard packages on AC
- Links:
- https://github.com/ansible/ansible/issues/25414
- https://joelvasallo.com/packer-ansible-unable-to-acquire-dpkg-lock-c7eb5863127d
- https://github.com/ansible/ansible/issues/51663
- Possible solutions:
- lock_timeout apt option (since 2.12)
- checking lock with lsof or fuser
- pgrep with unatended or other proccesses (bad)
- Official before lock_timeout
```
register: apt_action
retries: 100
until: apt_action is success or ('Failed to lock apt for exclusive operation' not in apt_action.msg and '/var/lib/dpkg/lock' not in apt_action.msg)
```
- system updates to finish: systemd-run --property="After=apt-daily.service apt-daily-upgrade.service" --wait /bin/true

[x] infra-role-bootstrap-linux : Consul | Create consul config directory fails with

AnsibleError: An unhandled exception occurred while templating '{{lookup("bitwarden", "consul/cluster", field="encryption-key")}}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'bitwarden'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0). Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0)

bw unlock and export helped

[ ] infra-role-bootstrap-linux/raw : Install mandatory packages returned on DO:

E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_multiverse_cnf_Commands-amd64 - open (2: No such file or directory)

│ E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64 - open (2: No such file or directory)
...
│ E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'
│ E: Sub-process returned an error code

~~looks like a corrupted cache, probably rm -rf /var/lib/apt/lists/* && apt update should do the trick~~
retry helps

[ ] infra-role-bootstrap-linux : Netdata | Restart service:
- ```
Could not find the requested service netdata: host
```

yakimant commented 9 months ago

netdata.service is not installed:

# /opt/netdata.gz.run --accept --target /opt/netdata -- --dont-wait --dont-start-it --disable-https --disable-cloud --disable-telemetry
...
 --- Install netdata at system init ---
ERROR: Failed to detect what type of service manager is in use.
/opt/netdata/usr/libexec/netdata/install-service.sh: 640: install_detect_service: not found
 --- Install (but not enable) netdata updater tool ---
cat: /system/netdata-updater.timer: No such file or directory
cat: /system/netdata-updater.service: No such file or directory
Update script is located at /opt/netdata/usr/libexec/netdata/netdata-updater.sh
...

Unfornutely it doesn't fail the installation.

yakimant commented 9 months ago

This code fails to detect systemd: https://github.com/netdata/netdata/blob/92515e41a52344fb1d346df5b54b953cb9de5055/system/install-service.sh.in#L182-L215

One of the issues:

# readlink /proc/1/exe
/usr/lib/systemd/systemd (deleted)

Note (deleted) in the end. Probably restart should help.

Second is probably in installer code itself - safe_pidof is not available.

yakimant commented 9 months ago

I don't know, why it is even looks at this file, here is wakuv2.shards for example:

❯ ansible all -i ansible/inventory/shards -a 'grep jammy-backports /etc/apt/sources.list'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse

❯ ansible all -i ansible/inventory/shards -a 'ls /var/lib/apt/lists/*_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.digitalocean.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/us-central1.gce.archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.cloud.aliyuncs.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64

Looks like they reverence the cloud specific repo mirrors.

jakubgs commented 9 months ago

I don't get your issue with Netdata, the _check_systemd function works fine:

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % head -n20 test.sh
#!/usr/bin/env sh
. ./functions.sh

_check_systemd() {
  pids=''
  p=''
  myns=''
  ns=''

  # if the directory /lib/systemd/system OR /usr/lib/systemd/system (SLES 12.x) does not exit, it is not systemd
  if [ ! -d /lib/systemd/system ] && [ ! -d /usr/lib/systemd/system ]; then
    echo "NO" && return 0
  fi

  # if there is no systemctl command, it is not systemd
  [ -z "$(command -v systemctl 2>/dev/null || true)" ] && echo "NO" && return 0

  # if pid 1 is systemd, it is systemd
  [ "$(basename "$(readlink /proc/1/exe)" 2> /dev/null)" = "systemd" ] && echo "YES" && return 0

admin@boot-02.ac-cn-hongkong-c.shards.test:~ % ./test.sh 
YES

Seems like something else is at play. Maybe just an upgrade will help, not sure tho.

jakubgs commented 9 months ago

Also, it seems like now Netdata has its own ubuntu repository we could use:

So maybe the best thing would be to ditch the shitty installer and just use their repo.

Although one disadvantage of that is that pinning a version is harder. But it does appear they provide multiple versions.

yakimant commented 9 months ago

I don't get your issue with Netdata, the _check_systemd function works fine:

I think they don't import functions.sh and pids=$(safe_pidof systemd 2> /dev/null) silently fails.

yakimant commented 9 months ago

[ ] Another minor issue: fuser exits with 1 if files not open by other cpu or one of the files doesn't exit:
```
# fuser /var/cache/fwupd/metadata.xmlb
/var/cache/fwupd/metadata.xmlb:  6080m
# echo $?
0
```

fuser /var/cache/fwupd/noneexist

Specified filename /var/cache/fwupd/noneexist does not exist.

echo $?

1

fuser /var/cache/apt/archives/lock

echo $?

1



I added debug and I can see some messages like:
Specified filename /var/lib/apt/lists/lock* does not exist.

So probably this code will not work as intended in some cases, when lock file doesn't exist (yet?):
https://github.com/status-im/infra-role-bootstrap-linux/blob/109e157f8d66a61981c23cd6e006d950fe75efe2/raw/tasks/main.yml#L9
It will exit the loop as if no locks are open.

yakimant commented 9 months ago

[x] You need to install "jmespath" prior to running json_query filter

│ TASK [infra-role-bootstrap-linux : Volume | Identify device without partitions] ***
│ fatal: [8.218.174.108]: FAILED! => {}
│
│ MSG:
│
│ You need to install "jmespath" prior to running json_query filter

Command to reproduce:

ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Fix: Install on the controller node:

pip install jmespath

Follow-up: Add it to the setup documentation or requirements.txt / poetry project to each fleet repo.

yakimant commented 9 months ago

Alibaba Cloud images:

shards.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.shards:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"

wakuv2.test:

❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
            image_id                = "ubuntu_22_04_x64_20G_alibase_20230208.vhd"
resource "alicloud_instance" "host" {
    image_id                           = "ubuntu_20_04_x64_20G_alibase_20200914.vhd"

Looks ok, although old hosts needs to be upgraded to 22.04 at some point.

yakimant commented 9 months ago

More on the netdata installation:

They even have a community supported playbook: https://learn.netdata.cloud/docs/installing/install-with-a-cicd-provisioning-system/ansible which runs the kickstart.sh script which will likely install deb from a repo.

The most popular role from Galaxy: https://github.com/mrlesmithjr/ansible-netdata/ runs netdata-installer.sh

Why they are so obsessed with installer scripts?

yakimant commented 9 months ago

[ ] ssh fingerprint issue duting the 'role::bootstrap:users tasks. Can happen on different steps, eg:

TASK [infra-role-bootstrap-linux : Create users groups] ************************
│ fatal: [8.218.174.108]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "8.218.174.108". Make sure this host can be reached over ssh: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
│ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
│ Someone could be eavesdropping on you right now (man-in-the-middle attack)!
│ It is also possible that a host key has just been changed.
│ The fingerprint for the ED25519 key sent by the remote host is
│ SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.
│ Please contact your system administrator.
│ Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.
│ Offending ED25519 key in /Users/status/.ssh/known_hosts:169
│ Agent forwarding is disabled to avoid man-in-the-middle attacks.
│ UpdateHostkeys is disabled because the host key is not trusted.
│ root@8.218.174.108: Permission denied (publickey).

Reproduced:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'

Workaround: Rerun Ansible without recreating an instance.

yakimant commented 9 months ago

Which is weird, because Ansible runs ssh with -o StrictHostKeyChecking=no, which should not check the fingerprint.

yakimant commented 9 months ago

Sometimes I see the issue, which is not failing Ansible:

 TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
 changed: [8.218.174.108] => {
     "changed": true,
     "rc": 0
 }

 STDERR:

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @

 @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

 Someone could be eavesdropping on you right now (man-in-the-middle attack)!

 It is also possible that a host key has just been changed.

 The fingerprint for the ED25519 key sent by the remote host is
 SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.

 Please contact your system administrator.

 Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.

 Offending ED25519 key in /Users/status/.ssh/known_hosts:169

 Agent forwarding is disabled to avoid man-in-the-middle attacks.

 UpdateHostkeys is disabled because the host key is not trusted.

 Shared connection to 8.218.174.108 closed.

yakimant commented 9 months ago

Didn't reproduce the netdata issue with:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-01.ac-cn-hongkong-c.shards.test"]'

Need to double check with recreation of instance.

# /opt/netdata/usr/libexec/netdata/install-service.sh --show-type
Detected platform: Linux
Detected service managers:
  - systemd: YES
  - openrc: NO
  - lsb: NO
  - initd: NO
  - runit: NO
Would use systemd service management.
# readlink /proc/1/exe
/usr/lib/systemd/systemd

No (deleted), so parsed properly.

yakimant commented 9 months ago

Caught the /var/lib/dpkg/lock-frontend issue:

TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       6815 F.... unattended-upgr
│ /var/lib/dpkg/lock-frontend:
│                      root       6815 F.... unattended-upgr
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
│ unattende 6815 root    8uW  REG  252,3        0 1786 /var/lib/dpkg/lock-frontend
│ unattende 6815 root  114uW  REG  252,3        0 1665 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Install SSHGuard package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 6815 (unattended-upgr)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

So it's /usr/bin/unattended-upgrades proccess.

yakimant commented 9 months ago

Caught again:

│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│                      USER        PID ACCESS COMMAND
│ /var/lib/dpkg/lock:  root       7191 F.... apt-get
│ /var/lib/dpkg/lock-frontend:
│                      root       7191 F.... apt-get
│ /var/cache/apt/archives/lock:
│                      root       7191 F.... apt-get
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
│ apt-get 7191 root    4uW  REG  252,1        0 71761 /var/lib/dpkg/lock-frontend
│ apt-get 7191 root    5uW  REG  252,1        0 71762 /var/lib/dpkg/lock
│ apt-get 7191 root    6uW  REG  252,1        0 69132 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Docker | Install package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 7191 (apt-get)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

apt-get this time

yakimant commented 9 months ago

[ ] ssh connection refused on GC:

│ TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Failed to connect to the host via ssh: ssh: connect to host 34.135.13.87 port 22: Connection refused

Reproduce:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -replace='module.boot.module.gc-us-central1-a[0].google_compute_instance.host["boot-01.gc-us-central1-a.shards.test"]' -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Maybe we need to wait a bit for instance fully available via ssh.

Workaround: Ansible rerun helps

jakubgs commented 9 months ago

I think you are really overthinking this. The sleep in the first task in bootstrap is there for a reason.

I think you should stop trying to fix alibaba nonsense locking for now. Is probably just because their bootstrap doesn't finish because the instance you're using is too slow.

jakubgs commented 9 months ago

Also, I would recommend keeping research like this in the issue, and not in the PR.

yakimant commented 9 months ago

Yeah, I stoped investigating the non-blocking issues as we agreed yesterday. I just post whatever issues I encounter and rerun Ansible, which helps so far.

yakimant commented 9 months ago

[ ] ssh Permission denied issue during the role::bootstrap:users tasks on GC

│ TASK [infra-role-bootstrap-linux : Kill ubuntu user processes] *****************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│     "changed": false,
│     "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "34.135.13.87". Make sure this host can be reached over ssh: admin@34.135.13.87: Permission denied (publickey).

Reproduced on the 2nd run after instance created:

ANSIBLE_VERBOSITY=1  terraform apply -auto-approve -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'

Rerun didn't help.

Need to add keys to admin user: https://github.com/status-im/infra-role-bootstrap-linux/pull/34

yakimant commented 9 months ago

I will create proper Issues afterwards as a follow-up.

yakimant commented 9 months ago

[ ] Trying to find an image, which supports this change: https://github.com/status-im/infra-role-nim-waku/commit/0de15086cd763305419d637415d7a3b5200e0cb8
statusteam/nim-waku:deploy-wakuv2-shards

statusteam/nim-waku:deploy-wakuv2-test doesn't support:

Unrecognized option 'pubsub-topic'
Try wakunode2 --help for more information.

Will revert to 75fa7e483cacccb482c99afddc7de3c25fb8a1fc in requirements for now

yakimant commented 9 months ago

[x] waku-peers fails to start:

$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug
[DEBUG] Connecting to Consul: localhost:8500
[INFO] Found 5 data centers.
[DEBUG] Querying: nim-waku (dc=do-ams3, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Found: boot-01.do-ams3.shards.test (env:shards,stage:test,nim,waku,libp2p)
[DEBUG] Querying: nim-waku (dc=aws-eu-central-1a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=he-eu-hel1, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=gc-us-central1-a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=ac-cn-hongkong-c, node_meta={'env': 'shards', 'stage': 'test'})
[INFO] Found 0 services.
Traceback (most recent call last):
File "/usr/local/bin/connect_waku_peers.py", line 154, in <module>
main()
File "/usr/local/bin/connect_waku_peers.py", line 125, in main
raise Exception('No services found')
Exception: No services found

~~probably~~ because no other nodes are started, will setup others now

yakimant commented 9 months ago

[x] another issue with waku-peers:


$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug

[DEBUG] RPC Call URL: http://localhost:8545 [DEBUG] RPC Call Payload: {'method': 'post_waku_v2_admin_v1_peers', 'params': [['/dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31']], 'jsonrpc': '2.0', 'id': 0} Traceback (most recent call last): File "/usr/local/bin/connect_waku_peers.py", line 154, in main() File "/usr/local/bin/connect_waku_peers.py", line 142, in main raise Exception('RPC Error: %s' % rval['error']) Exception: RPC Error: {'code': -32000, 'message': 'post_waku_v2_admin_v1_peers raised an exception', 'data': 'Failed to connect to peer at index: 0 /dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31'}



Disabled the role, but probably it will fire up later

yakimant commented 9 months ago

[x] loop_control issues, label should be a string:


❯ ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "open-ports" -i ansible/inventory/test -v
Using /Users/status/work/infra-shards/ansible.cfg as config file
ERROR! The field 'label' is supposed to be a string type, however the incoming data structure is a <class 'ansible.parsing.yaml.objects.AnsibleMapping'>

The error appears to be in '/Users/status/.ansible/roles/open-ports/tasks/main.yml': line 20, column 5, but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

loop_control: label: ^ here



https://github.com/status-im/infra-role-open-ports/blob/24dc30dbdf85e6758cb6924074b2f7a0f4541524/tasks/main.yml#L19-L23

Removed loop_control as a workaround

yakimant commented 9 months ago

[ ] nim_waku_node_key extraction from file files if already created and not setup by variable


ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "nim-waku" -i ansible/inventory/test -v
...
TASK [nim-waku : Generate random node key] ***********************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
"changed": false,
"false_condition": "not key_file.stat.exists and nim_waku_node_key is not defined\n",
"skip_reason": "Conditional result was False"
}

TASK [nim-waku : Save generate node key to file] ***** skipping: [boot-01.do-ams3.shards.test] => { "changed": false, "false_condition": "not key_file.stat.exists", "skip_reason": "Conditional result was False" }

TASK [nim-waku : Load existing node key from file] *** skipping: [boot-01.do-ams3.shards.test] => { "changed": false, "false_condition": "key_generation.skipped is not defined and nim_waku_node_key is not defined\n", "skip_reason": "Conditional result was False" }

TASK [nim-waku : Extract the node key from file] ***** fatal: [boot-01.do-ams3.shards.test]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'content'. 'dict object' has no attribute 'content'

The error appears to be in '/Users/status/.ansible/roles/nim-waku/tasks/nodekey.yml': line 39, column 3, but may be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

name: Extract the node key from file ^ here

Load is skipped wrongly, because generation is skipped. Maybe it should be the opposite? If generation is skipped - load from file.

https://github.com/status-im/infra-role-nim-waku/blob/75fa7e483cacccb482c99afddc7de3c25fb8a1fc/tasks/nodekey.yml#L31-L37

yakimant commented 9 months ago

to debug / catch the lock issues, I was adding:

name: check locks (fuser)
raw: |
  sudo fuser --verbose /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

name: check locks (lsof)
raw: |
  sudo lsof /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true

before bootstrap/raw and apt commands.

Also, I think apt has the ability to wait for locks, but not the package. Will check in the related issue.

yakimant commented 9 months ago

Potential temporary workaround for netdata: copy /opt/netdata/system/netdata.service to /lib/systemd/system/netdata.service

yakimant commented 9 months ago

This PR is closed in a favour of these 3 as requested by @jakubgs:

The following issues were discovered during the work on this PR:

status-im / infra-shards

add boot hosts #1

fuser /var/cache/fwupd/noneexist

echo $?

fuser /var/cache/apt/archives/lock

echo $?