Closed coesensbert closed 11 months ago
raised to major
which environment?
Can we have the which network? contract id and or node id?
All nets. Discovered the issue on devnet, this vm run's on mainnet node 3168 - 10.13.0.122 image: https://hub.grid.tf/tf-official-vms/ubuntu-22.04-lts.flist
➜ testvm4 terraform show
# grid_deployment.d1:
resource "grid_deployment" "d1" {
id = "52345"
name = "vm"
network_name = "testingetw4"
node = 3168
solution_provider = 0
solution_type = "Virtual Machine"
disks {
name = "root"
size = 25
}
vms {
computedip = "185.69.167.210/24"
computedip6 = "2a02:1802:5e:0:d02f:96ff:febc:27b2/64"
console_url = "10.192.3.1:20002"
corex = false
cpu = 4
description = "Threefold Ops tst vm4"
env_vars = {
"SSH_KEY" = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDYNeJXJV2FNEwuQz6e0jkKeqRbKWwftBKq+sjSTqa2x"
}
flist = "https://hub.grid.tf/tf-official-vms/ubuntu-22.04-lts.flist"
ip = "10.192.3.2"
memory = 4096
name = "opstestvm4"
planetary = true
publicip = true
publicip6 = true
rootfs_size = 0
ygg_ip = "300:8b9f:68a4:9450:2f99:7f95:9c7f:d3ce"
mounts {
disk_name = "root"
mount_point = "/data"
}
}
}
# grid_network.net1:
resource "grid_network" "net1" {
access_wg_config = <<-EOT
[Interface]
Address = 100.64.192.2
PrivateKey = aFc6+kbRxMc1zqdNKw0IB/A05ev3kCNKBd3GHvcbg3c=
[Peer]
PublicKey = kxbXx1AwpztQTFfTUdPskDVOxpUA/qivRyi+8ouPPWo=
AllowedIPs = 10.192.0.0/16, 100.64.0.0/16
PersistentKeepalive = 25
Endpoint = 185.69.166.140:2464
EOT
add_wg_access = true
description = "node network 4"
external_ip = "10.192.2.0/24"
external_sk = "aFc6+kbRxMc1zqdNKw0IB/A05ev3kCNKBd3GHvcbg3c="
id = "94384757-6b3a-45d8-9968-9b6a5643da0d"
ip_range = "10.192.0.0/16"
name = "testingetw4"
node_deployment_id = {
"1" = 52344
"3168" = 52343
}
nodes = [
3168,
]
nodes_ip_range = {
"1" = "10.192.4.0/24"
"3168" = "10.192.3.0/24"
}
public_node_id = 1
solution_type = "Network"
}
Outputs:
node1_vm1_ip = "10.192.3.2"
public_ip = "185.69.167.210/24"
public_ip6 = "2a02:1802:5e:0:d02f:96ff:febc:27b2/64"
wg_config = <<-EOT
[Interface]
Address = 100.64.192.2
PrivateKey = aFc6+kbRxMc1zqdNKw0IB/A05ev3kCNKBd3GHvcbg3c=
[Peer]
PublicKey = kxbXx1AwpztQTFfTUdPskDVOxpUA/qivRyi+8ouPPWo=
AllowedIPs = 10.192.0.0/16, 100.64.0.0/16
PersistentKeepalive = 25
Endpoint = 185.69.166.140:2464
EOT
ygg_ip = "300:8b9f:68a4:9450:2f99:7f95:9c7f:d3ce"
Thanks @muhamadazmy for assisting. Faulty kernel after an apt upgrade is linux-image-5.15.0-88-generic Resolved with: linux-image-5.15.0-90-generic
add-apt-repository ppa:canonical-kernel-team/ppa
apt update
apt install linux-image-5.15.0-90-generic
Do NOT upgrade AND reboot existing deployments, if needed deploy above kernel.
When testing the latest cloud image: https://cloud-images.ubuntu.com/jammy/current/ and doing an apt upgrade, it also installs the faulty linux-image-5.15.0-88-generic with same result after reboot
@xmonader it's 100% the ubuntu kernel fault, already in the change log for the next build a fix for virtion-net issues (same module that is causing the crash during the boot)
Update to the next kernel (requires using ppa:canonical-kernel-team/ppa
as @coesensbert commented) brings the one with the fix and vm reboot goes well.
@coesensbert suggests that we run daily test (deploy, update, reboot, check) to make sure things like that don't get un-noticed.
I assume other cloud providers host their own packages repos so updates to their VMs are tested well before they are made available to the users.
Reproduce:
apt update && apt upgrade -y
reboot
Console output:
Another vm in this state: https://gist.github.com/coesensbert/f36c631cf96448517918ed67393da566