Closed yakimant closed 1 year ago
Current issues:
Set sshguard4 doesn't exist
ansible/requirements.yml
updateSSH keys setup:
# module.boot.module.do-eu-amsterdam3[0].digitalocean_droplet.host["boot-01.do-ams3.boot.test"] will be created
...
+ ssh_keys = [
+ "20671731",
]
# module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-01.ac-cn-hongkong-c.shards.test"] will be created
...
+ key_name = "jakubgs"
For DO we can do it: https://github.com/status-im/infra-tf-digital-ocean/pull/1
For AC, looks like one key allowed only: https://registry.terraform.io/providers/aliyun/alicloud/latest/docs/resources/instance
As an alternative:
infra-tf-multi-provider
, so each devops override it locallyProper solution was to change ansible role locally.
sshguard4
should be configured by sshguard
automatically, I guess.
It's failing with following in the logs:
sshguard: '/usr/lib/x86_64-linux-gnu/sshg-fw-ipset' is not executable
Need to investigate the logic: https://github.com/status-im/infra-role-bootstrap-linux/blob/827e55412990026ad43756bd11f2cb698bdea622/templates/sshguard/sshguard.conf.j2#L3-L7
.terraform/modules/boot.ac-cn-hongkong-c/variables.tf
New issues:
infra-role-bootstrap-linux : Make sure essential pip packages are installed TAGS: [role::bootstrap:packages]
fails on with AttributeError: cython_sources
pip install "Cython<3.0" && pip install "PyYAML==5.4.1" --no-build-isolation
: https://github.com/status-im/infra-role-bootstrap-linux/pull/33infra-role-bootstrap-linux/raw : Install mandatory packages
runs endlessly
apt install
:
apt -y install python3-minimal acl
`-apt -y install python3-minimal acl
`-sh -c test -x /usr/lib/needrestart/apt-pinvoke && /usr/lib/needrestart/apt-pinvoke || true
`-frontend -w /usr/share/debconf/frontend /usr/sbin/needrestart
|-needrestart /usr/sbin/needrestart
`-whiptail --backtitle Package configuration --title Daemons using outdated libraries --output-fd 11 --separate-output --checklist \012\012Which services should be restarted? 12 47 2 -- packagekit.service on unattended-upgrades.service
whiptail
is for dialogs, probably it's waiting for some input
Looks like needrestart
should be setup for non-interactive ansible:
Other issues
infra-role-bootstrap-linux : Docker | Install package
failing with:
'/usr/bin/apt-get -y -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" install 'docker-ce=5:24.0.6-1~ubuntu.22.04~jammy' 'docker-compose=1.29.2-1'' failed: E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 13119 (apt-get)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
infra-role-wireguard : Install WireGuard packages
on AClock_timeout
apt option (since 2.12
)lsof
or fuser
lock_timeout
register: apt_action
retries: 100
until: apt_action is success or ('Failed to lock apt for exclusive operation' not in apt_action.msg and '/var/lib/dpkg/lock' not in apt_action.msg)
systemd-run --property="After=apt-daily.service apt-daily-upgrade.service" --wait /bin/true
infra-role-bootstrap-linux : Consul | Create consul config directory
fails with
AnsibleError: An unhandled exception occurred while templating '{{lookup("bitwarden", "consul/cluster", field="encryption-key")}}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'bitwarden'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0). Error decoding Bitwarden status: Expecting value: line 1 column 1 (char 0)
bw unlock
and export
helpedinfra-role-bootstrap-linux/raw : Install mandatory packages
returned on DO:
E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_multiverse_cnf_Commands-amd64 - open (2: No such file or directory)
│ E: Could not open file /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64 - open (2: No such file or directory)
...
│ E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'
│ E: Sub-process returned an error code
rm -rf /var/lib/apt/lists/* && apt update
should do the trickinfra-role-bootstrap-linux : Netdata | Restart service
:
Could not find the requested service netdata: host
netdata.service
is not installed:
# /opt/netdata.gz.run --accept --target /opt/netdata -- --dont-wait --dont-start-it --disable-https --disable-cloud --disable-telemetry
...
--- Install netdata at system init ---
ERROR: Failed to detect what type of service manager is in use.
/opt/netdata/usr/libexec/netdata/install-service.sh: 640: install_detect_service: not found
--- Install (but not enable) netdata updater tool ---
cat: /system/netdata-updater.timer: No such file or directory
cat: /system/netdata-updater.service: No such file or directory
Update script is located at /opt/netdata/usr/libexec/netdata/netdata-updater.sh
...
Unfornutely it doesn't fail the installation.
This code fails to detect systemd
:
https://github.com/netdata/netdata/blob/92515e41a52344fb1d346df5b54b953cb9de5055/system/install-service.sh.in#L182-L215
One of the issues:
# readlink /proc/1/exe
/usr/lib/systemd/systemd (deleted)
Note (deleted) in the end. Probably restart should help.
Second is probably in installer code itself - safe_pidof
is not available.
I don't know, why it is even looks at this file, here is wakuv2.shards for example:
❯ ansible all -i ansible/inventory/shards -a 'grep jammy-backports /etc/apt/sources.list'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://mirrors.digitalocean.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy-backports main restricted universe multiverse
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
deb http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src http://mirrors.cloud.aliyuncs.com/ubuntu/ jammy-backports main restricted universe multiverse
❯ ansible all -i ansible/inventory/shards -a 'ls /var/lib/apt/lists/*_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64'
node-01.do-ams3.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.digitalocean.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.gc-us-central1-a.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/us-central1.gce.archive.ubuntu.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
node-01.ac-cn-hongkong-c.wakuv2.shards | CHANGED | rc=0 >>
/var/lib/apt/lists/mirrors.cloud.aliyuncs.com_ubuntu_dists_jammy-backports_universe_cnf_Commands-amd64
Looks like they reverence the cloud specific repo mirrors.
I don't get your issue with Netdata, the _check_systemd
function works fine:
admin@boot-02.ac-cn-hongkong-c.shards.test:~ % head -n20 test.sh
#!/usr/bin/env sh
. ./functions.sh
_check_systemd() {
pids=''
p=''
myns=''
ns=''
# if the directory /lib/systemd/system OR /usr/lib/systemd/system (SLES 12.x) does not exit, it is not systemd
if [ ! -d /lib/systemd/system ] && [ ! -d /usr/lib/systemd/system ]; then
echo "NO" && return 0
fi
# if there is no systemctl command, it is not systemd
[ -z "$(command -v systemctl 2>/dev/null || true)" ] && echo "NO" && return 0
# if pid 1 is systemd, it is systemd
[ "$(basename "$(readlink /proc/1/exe)" 2> /dev/null)" = "systemd" ] && echo "YES" && return 0
admin@boot-02.ac-cn-hongkong-c.shards.test:~ % ./test.sh
YES
Seems like something else is at play. Maybe just an upgrade will help, not sure tho.
Also, it seems like now Netdata has its own ubuntu repository we could use:
So maybe the best thing would be to ditch the shitty installer and just use their repo.
Although one disadvantage of that is that pinning a version is harder. But it does appear they provide multiple versions.
I don't get your issue with Netdata, the
_check_systemd
function works fine:
I think they don't import functions.sh
and pids=$(safe_pidof systemd 2> /dev/null)
silently fails.
fuser
exits with 1 if files not open by other cpu or one of the files doesn't exit:
# fuser /var/cache/fwupd/metadata.xmlb
/var/cache/fwupd/metadata.xmlb: 6080m
# echo $?
0
Specified filename /var/cache/fwupd/noneexist does not exist.
1
1
I added debug and I can see some messages like:
Specified filename /var/lib/apt/lists/lock* does not exist.
So probably this code will not work as intended in some cases, when lock file doesn't exist (yet?):
https://github.com/status-im/infra-role-bootstrap-linux/blob/109e157f8d66a61981c23cd6e006d950fe75efe2/raw/tasks/main.yml#L9
It will exit the loop as if no locks are open.
│ TASK [infra-role-bootstrap-linux : Volume | Identify device without partitions] ***
│ fatal: [8.218.174.108]: FAILED! => {}
│
│ MSG:
│
│ You need to install "jmespath" prior to running json_query filter
Command to reproduce:
ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'
Fix: Install on the controller node:
pip install jmespath
Follow-up:
Add it to the setup documentation or requirements.txt
/ poetry project to each fleet repo.
Alibaba Cloud images:
shards.test
:
❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
image_id = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
image_id = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
image_id = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
wakuv2.shards
:
❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
image_id = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
resource "alicloud_instance" "host" {
image_id = "ubuntu_22_04_x64_20G_alibase_20230613.vhd"
wakuv2.test
:
❯ terraform show | grep "image_id\|alicloud_images\|alicloud_instance" | grep -v module
data "alicloud_images" "host" {
image_id = "ubuntu_22_04_x64_20G_alibase_20230208.vhd"
resource "alicloud_instance" "host" {
image_id = "ubuntu_20_04_x64_20G_alibase_20200914.vhd"
Looks ok, although old hosts needs to be upgraded to 22.04
at some point.
More on the netdata installation:
They even have a community supported playbook:
https://learn.netdata.cloud/docs/installing/install-with-a-cicd-provisioning-system/ansible
which runs the kickstart.sh
script which will likely install deb from a repo.
The most popular role from Galaxy:
https://github.com/mrlesmithjr/ansible-netdata/
runs netdata-installer.sh
Why they are so obsessed with installer scripts?
'role::bootstrap:users
tasks.
Can happen on different steps, eg:
TASK [infra-role-bootstrap-linux : Create users groups] ************************
│ fatal: [8.218.174.108]: UNREACHABLE! => {
│ "changed": false,
│ "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "8.218.174.108". Make sure this host can be reached over ssh: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
│ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
│ Someone could be eavesdropping on you right now (man-in-the-middle attack)!
│ It is also possible that a host key has just been changed.
│ The fingerprint for the ED25519 key sent by the remote host is
│ SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.
│ Please contact your system administrator.
│ Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.
│ Offending ED25519 key in /Users/status/.ssh/known_hosts:169
│ Agent forwarding is disabled to avoid man-in-the-middle attacks.
│ UpdateHostkeys is disabled because the host key is not trusted.
│ root@8.218.174.108: Permission denied (publickey).
Reproduced:
ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='module.boot.module.ac-cn-hongkong-c[0].alicloud_instance.host["boot-02.ac-cn-hongkong-c.shards.test"]' -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-02.ac-cn-hongkong-c.shards.test"]'
Workaround: Rerun Ansible without recreating an instance.
Which is weird, because Ansible runs ssh with -o StrictHostKeyChecking=no
, which should not check the fingerprint.
Sometimes I see the issue, which is not failing Ansible:
TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
changed: [8.218.174.108] => {
"changed": true,
"rc": 0
}
STDERR:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:aOSuugoc0NWC8EDVlrEujshzWdlh4TYD+SMAmUngXEo.
Please contact your system administrator.
Add correct host key in /Users/status/.ssh/known_hosts to get rid of this message.
Offending ED25519 key in /Users/status/.ssh/known_hosts:169
Agent forwarding is disabled to avoid man-in-the-middle attacks.
UpdateHostkeys is disabled because the host key is not trusted.
Shared connection to 8.218.174.108 closed.
Didn't reproduce the netdata
issue with:
ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -target='module.boot.module.ac-cn-hongkong-c[0].null_resource.host["boot-01.ac-cn-hongkong-c.shards.test"]'
Need to double check with recreation of instance.
# /opt/netdata/usr/libexec/netdata/install-service.sh --show-type
Detected platform: Linux
Detected service managers:
- systemd: YES
- openrc: NO
- lsb: NO
- initd: NO
- runit: NO
Would use systemd service management.
# readlink /proc/1/exe
/usr/lib/systemd/systemd
No (deleted)
, so parsed properly.
Caught the /var/lib/dpkg/lock-frontend
issue:
TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ USER PID ACCESS COMMAND
│ /var/lib/dpkg/lock: root 6815 F.... unattended-upgr
│ /var/lib/dpkg/lock-frontend:
│ root 6815 F.... unattended-upgr
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
│ unattende 6815 root 8uW REG 252,3 0 1786 /var/lib/dpkg/lock-frontend
│ unattende 6815 root 114uW REG 252,3 0 1665 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Install SSHGuard package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 6815 (unattended-upgr)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
So it's /usr/bin/unattended-upgrades
proccess.
Caught again:
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ USER PID ACCESS COMMAND
│ /var/lib/dpkg/lock: root 7191 F.... apt-get
│ /var/lib/dpkg/lock-frontend:
│ root 7191 F.... apt-get
│ /var/cache/apt/archives/lock:
│ root 7191 F.... apt-get
...
│ TASK [infra-role-bootstrap-linux : check locks] ********************************
...
│ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
│ apt-get 7191 root 4uW REG 252,1 0 71761 /var/lib/dpkg/lock-frontend
│ apt-get 7191 root 5uW REG 252,1 0 71762 /var/lib/dpkg/lock
│ apt-get 7191 root 6uW REG 252,1 0 69132 /var/cache/apt/archives/lock
...
│ TASK [infra-role-bootstrap-linux : Docker | Install package] *******************
...
│ E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 7191 (apt-get)
│ E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
apt-get
this time
│ TASK [infra-role-bootstrap-linux/raw : check locks] ****************************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│ "changed": false,
│ "unreachable": true
│ }
│
│ MSG:
│
│ Failed to connect to the host via ssh: ssh: connect to host 34.135.13.87 port 22: Connection refused
Reproduce:
ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -replace='module.boot.module.gc-us-central1-a[0].google_compute_instance.host["boot-01.gc-us-central1-a.shards.test"]' -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'
Maybe we need to wait a bit for instance fully available via ssh.
Workaround: Ansible rerun helps
I think you are really overthinking this. The sleep in the first task in bootstrap is there for a reason.
I think you should stop trying to fix alibaba nonsense locking for now. Is probably just because their bootstrap doesn't finish because the instance you're using is too slow.
Also, I would recommend keeping research like this in the issue, and not in the PR.
Yeah, I stoped investigating the non-blocking issues as we agreed yesterday. I just post whatever issues I encounter and rerun Ansible, which helps so far.
Permission denied
issue during the role::bootstrap:users
tasks on GC
│ TASK [infra-role-bootstrap-linux : Kill ubuntu user processes] *****************
│ fatal: [34.135.13.87]: UNREACHABLE! => {
│ "changed": false,
│ "unreachable": true
│ }
│
│ MSG:
│
│ Data could not be sent to remote host "34.135.13.87". Make sure this host can be reached over ssh: admin@34.135.13.87: Permission denied (publickey).
Reproduced on the 2nd run after instance created:
ANSIBLE_VERBOSITY=1 terraform apply -auto-approve -target='module.boot.module.gc-us-central1-a[0].null_resource.host["boot-01.gc-us-central1-a.shards.test"]'
Rerun didn't help.
Need to add keys to admin user: https://github.com/status-im/infra-role-bootstrap-linux/pull/34
I will create proper Issues afterwards as a follow-up.
[ ] Trying to find an image, which supports this change: https://github.com/status-im/infra-role-nim-waku/commit/0de15086cd763305419d637415d7a3b5200e0cb8
statusteam/nim-waku:deploy-wakuv2-shards
statusteam/nim-waku:deploy-wakuv2-test doesn't support:
Unrecognized option 'pubsub-topic'
Try wakunode2 --help for more information.
Will revert to 75fa7e483cacccb482c99afddc7de3c25fb8a1fc
in requirements for now
waku-peers
fails to start:
$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug
[DEBUG] Connecting to Consul: localhost:8500
[INFO] Found 5 data centers.
[DEBUG] Querying: nim-waku (dc=do-ams3, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Found: boot-01.do-ams3.shards.test (env:shards,stage:test,nim,waku,libp2p)
[DEBUG] Querying: nim-waku (dc=aws-eu-central-1a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=he-eu-hel1, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=gc-us-central1-a, node_meta={'env': 'shards', 'stage': 'test'})
[DEBUG] Querying: nim-waku (dc=ac-cn-hongkong-c, node_meta={'env': 'shards', 'stage': 'test'})
[INFO] Found 0 services.
Traceback (most recent call last):
File "/usr/local/bin/connect_waku_peers.py", line 154, in <module>
main()
File "/usr/local/bin/connect_waku_peers.py", line 125, in main
raise Exception('No services found')
Exception: No services found
probably because no other nodes are started, will setup others now
waku-peers
:
$ /usr/local/bin/connect_waku_peers.py --rpc-host=localhost --rpc-port=8545 --rpc-timeout=20 --rpc-retries=5 --service='{"name": "nim-waku", "env": "shards", "stage": "test"}' --log-level=debug
[DEBUG] RPC Call URL: http://localhost:8545
[DEBUG] RPC Call Payload: {'method': 'post_waku_v2_admin_v1_peers', 'params': [['/dns4/boot-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-01.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31', '/dns4/boot-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31']], 'jsonrpc': '2.0', 'id': 0}
Traceback (most recent call last):
File "/usr/local/bin/connect_waku_peers.py", line 154, in
Disabled the role, but probably it will fire up later
❯ ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "open-ports" -i ansible/inventory/test -v
Using /Users/status/work/infra-shards/ansible.cfg as config file
ERROR! The field 'label' is supposed to be a string type, however the incoming data structure is a <class 'ansible.parsing.yaml.objects.AnsibleMapping'>
The error appears to be in '/Users/status/.ansible/roles/open-ports/tasks/main.yml': line 20, column 5, but may be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
loop_control: label: ^ here
https://github.com/status-im/infra-role-open-ports/blob/24dc30dbdf85e6758cb6924074b2f7a0f4541524/tasks/main.yml#L19-L23
Removed loop_control as a workaround
nim_waku_node_key
extraction from file files if already created and not setup by variable
ansible-playbook ansible/main.yml --limit boot-01.do-ams3.shards.test --tags "nim-waku" -i ansible/inventory/test -v
...
TASK [nim-waku : Generate random node key] ***********************************************************************
skipping: [boot-01.do-ams3.shards.test] => {
"changed": false,
"false_condition": "not key_file.stat.exists and nim_waku_node_key is not defined\n",
"skip_reason": "Conditional result was False"
}
TASK [nim-waku : Save generate node key to file] ***** skipping: [boot-01.do-ams3.shards.test] => { "changed": false, "false_condition": "not key_file.stat.exists", "skip_reason": "Conditional result was False" }
TASK [nim-waku : Load existing node key from file] *** skipping: [boot-01.do-ams3.shards.test] => { "changed": false, "false_condition": "key_generation.skipped is not defined and nim_waku_node_key is not defined\n", "skip_reason": "Conditional result was False" }
TASK [nim-waku : Extract the node key from file] ***** fatal: [boot-01.do-ams3.shards.test]: FAILED! => {}
MSG:
The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'content'. 'dict object' has no attribute 'content'
The error appears to be in '/Users/status/.ansible/roles/nim-waku/tasks/nodekey.yml': line 39, column 3, but may be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
Load is skipped wrongly, because generation is skipped. Maybe it should be the opposite? If generation is skipped - load from file.
to debug / catch the lock issues, I was adding:
name: check locks (fuser)
raw: |
sudo fuser --verbose /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true
name: check locks (lsof)
raw: |
sudo lsof /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock* || true
before bootstrap/raw
and apt
commands.
Also, I think apt
has the ability to wait for locks, but not the package
.
Will check in the related issue.
Potential temporary workaround for netdata: copy /opt/netdata/system/netdata.service to /lib/systemd/system/netdata.service
This PR is closed in a favour of these 3 as requested by @jakubgs:
The following issues were discovered during the work on this PR:
Ansible issue
Solved by: