sandia-minimega / minimega

minimega
GNU General Public License v3.0
148 stars 67 forks source link

[minimega] Recover VMs still running after a minimega crash #1493

Closed activeshadow closed 1 year ago

activeshadow commented 1 year ago

This PR attempts to add support to minimega for recovering VMs after a crash. Currently, here's how it works:

Currently this only runs on the local node. This should be fine, since each node in a mesh would be responsible for their own recovery process.

activeshadow commented 1 year ago

@aherna this is still a work in progress, but it's finally coming along nicely. As of now, I can recover a VM that's still running, get everything to show up correctly in vm info table, connect to it over VNC, etc. There are still lots of edge cases that I haven't tested yet.

Just wanted to go ahead and get this in front of you.

aherna commented 1 year ago

@aherna this is still a work in progress, but it's finally coming along nicely. As of now, I can recover a VM that's still running, get everything to show up correctly in vm info table, connect to it over VNC, etc. There are still lots of edge cases that I haven't tested yet.

Just wanted to go ahead and get this in front of you.

@jacdavi is going to get this review started

activeshadow commented 1 year ago

Sounds good @aherna.

@jacdavi as an FYI, I still need to support recovering VMs that were in the BUILDING stage when minimega crashed or otherwise went away suddenly. I also have not yet looked into recovering containers.

activeshadow commented 1 year ago

Hey @jacdavi, have you had a chance to take look at these changes and/or test anything yet?

jacdavi commented 1 year ago

Yeah sorry, looking at it now.

I started minimega, started a vm (just configing disk) then ran pkill -9 minimega -e.

Once restarting minimega with -recover I get an error FATAL main.go:259: Unable to read taps file for vm 0: open /tmp/minimega/0/taps: no such file or directory and it exits.

I'm guessing that's because I haven't made any taps, but it seems like if that's the case it shouldn't error out

activeshadow commented 1 year ago

Yeah sorry, looking at it now.

I started minimega, started a vm (just configing disk) then ran pkill -9 minimega -e.

Once restarting minimega with -recover I get an error FATAL main.go:259: Unable to read taps file for vm 0: open /tmp/minimega/0/taps: no such file or directory and it exits.

I'm guessing that's because I haven't made any taps, but it seems like if that's the case it shouldn't error out

Thanks! Easy fix. This is what I need... I had tested so many things while developing I couldn't remember what I hadn't tested.

I'll get this fixed tonight, along with anything else you find before then. 😂

jacdavi commented 1 year ago

Error above looks good now! I'll let you know if I figure out how to break anything else :)

jacdavi commented 1 year ago

I'm trying to recover our entire phenix experiment and get a seg fault. I'll try to debug some more, but you may know what the issue is quicker than me


./minimega/bin/minimega -context the-$HOSTNAME -filepath phenix/images -nostdin -recover -level debug

minimega, Copyright (2014) Sandia Corporation.
Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
the U.S. Government retains certain rights in this software.
2023/02/20 23:09:33 INFO external.go:241: kernel version: 4.15.0, minimum: 3.18
2023/02/20 23:09:33 INFO external.go:124: kvm found at: /usr/bin/kvm
2023/02/20 23:09:33 INFO external.go:124: tc found at: /sbin/tc
2023/02/20 23:09:33 INFO external.go:124: modprobe found at: /sbin/modprobe
2023/02/20 23:09:33 INFO external.go:124: dnsmasq found at: /usr/sbin/dnsmasq
2023/02/20 23:09:33 INFO external.go:124: dhclient found at: /sbin/dhclient
2023/02/20 23:09:33 INFO external.go:124: scp found at: /usr/bin/scp
2023/02/20 23:09:33 INFO external.go:124: ovs-vsctl found at: /usr/bin/ovs-vsctl
2023/02/20 23:09:33 INFO external.go:124: ovs-ofctl found at: /usr/bin/ovs-ofctl
2023/02/20 23:09:33 INFO external.go:124: ip found at: /sbin/ip
2023/02/20 23:09:33 INFO external.go:124: ssh found at: /usr/bin/ssh
2023/02/20 23:09:33 INFO external.go:124: qemu-img found at: /usr/bin/qemu-img
2023/02/20 23:09:33 INFO external.go:124: blockdev found at: /sbin/blockdev
2023/02/20 23:09:33 INFO external.go:124: tar found at: /bin/tar
2023/02/20 23:09:33 INFO external.go:124: mount found at: /bin/mount
2023/02/20 23:09:33 INFO external.go:124: cp found at: /bin/cp
2023/02/20 23:09:33 INFO external.go:124: ntfs-3g found at: /bin/ntfs-3g
2023/02/20 23:09:33 INFO external.go:124: taskset found at: /usr/bin/taskset
2023/02/20 23:09:33 INFO external.go:124: lsmod found at: /sbin/lsmod
2023/02/20 23:09:33 INFO external.go:124: qemu-nbd found at: /usr/bin/qemu-nbd
2023/02/20 23:09:33 DEBUG external.go:281: cmd dnsmasq completed in 5.048753ms
2023/02/20 23:09:33 INFO external.go:241: dnsmasq version: 2.79, minimum: 2.73
2023/02/20 23:09:33 DEBUG external.go:281: cmd ovs-vsctl completed in 2.970849ms
2023/02/20 23:09:33 INFO external.go:241: ovs-vsctl version: 2.9.0, minimum: 1.11
2023/02/20 23:09:33 DEBUG external.go:281: cmd kvm completed in 25.097428ms
2023/02/20 23:09:33 INFO external.go:241: qemu version: 2.11.1, minimum: 1.6
2023/02/20 23:09:33 DEBUG ovs.go:221: running ovs cmd: /usr/bin/ovs-vsctl set Open_vSwitch . other_config:vlan-limit=2
2023/02/20 23:09:33 INFO external.go:260: checking for kernel module: kvm
2023/02/20 23:09:33 DEBUG external.go:281: cmd lsmod completed in 2.333555ms
2023/02/20 23:09:33 INFO main.go:214: attempting to recover any vms from previous minimega instances
2023/02/20 23:09:33 INFO vlans.go:141: adding VLAN alias recovery//example => 101
...
2023/02/20 23:09:33 INFO main.go:230: recovering vms for namespace recovery
2023/02/20 23:09:33 INFO namespace.go:119: creating new namespace -- `recovery`
2023/02/20 23:09:33 INFO server.go:98: registered new ron server: phenix/images/recovery
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc0 pc=0x6068e7]

goroutine 1 [running]:
github.com/sandia-minimega/minimega/v2/internal/meshage.(*Node).checkUpdateNetwork(0x0)
        /scratch/minimega/internal/meshage/node.go:593 +0x27
github.com/sandia-minimega/minimega/v2/internal/meshage.(*Node).BroadcastRecipients(0x0)
        /scratch/minimega/internal/meshage/message.go:183 +0x5f
main.NewNamespace({0xc0004870f9, 0x8})
        /scratch/minimega/cmd/minimega/namespace.go:167 +0x65f
main.GetOrCreateNamespace({0xc0004870f9, 0x8})
        /scratch/minimega/cmd/minimega/namespace.go:901 +0xc8
main.main()
        /scratch/minimega/cmd/minimega/main.go:232 +0x1349
activeshadow commented 1 year ago

Thanks @jacdavi I'll look into it more tonight.

activeshadow commented 1 year ago

@jacdavi looks like you're hitting this because I've yet to test recovering VMs from a namespace other than the default minimega namespace. I will need to relocate the recovery code to be run after the mesh has been started. Shouldn't be a big deal. I'll get a commit pushed out tonight or first thing tomorrow for you to redo this test against. Thanks!

jacdavi commented 1 year ago

Thanks for the quick updates! Ran the test again and I made it further. It looks like we get through recovering a number of VMs, but fail on internet_router (which is built from a phenix app).

Here's the error:

panic: assignment to entry in nil map

goroutine 1 [running]:
main.(*BaseConfig).ReadConfig(0xc00027d8b8, {0xa99020?, 0xc0003b7328?}, {0xc0002de479, 0x8})
        /scratch/minimega/cmd/minimega/vmconfiger_cli.go:1072 +0x1e5
main.(*VMConfig).ReadConfig(0xc00027d8b8, {0xa9ab88?, 0xc0003b7328}, {0xc0002de479, 0x8})
        /scratch/minimega/cmd/minimega/vmconfig.go:136 +0x65
main.recover()
        /scratch/minimega/cmd/minimega/recovery.go:56 +0x58b
main.main()
        /scratch/minimega/cmd/minimega/main.go:251 +0xf3b

I tested a quick fix by adding this to vmconfiger_cli#ReadConfig and it fixed the error. No crashes after that though I didn't see things come back up. I'll look into why that was more tomorrow

// Initialize tags, if not already
if v.Tags == nil {
    v.Tags = map[string]string{}
}
activeshadow commented 1 year ago

@jacdavi the vmconfiger_cli.go file gets auto generated by the vmconfiger executable using some templates and code reflection. I'll have to fix the template to make sure the Tags map gets initialized if needed like you did in your quick fix. Let me know if you were able to test anymore today.

jacdavi commented 1 year ago

Ok sounds good. You might be able to move the map initialization up the call stack, I just tried there because it was easiest.

activeshadow commented 1 year ago

Ok sounds good. You might be able to move the map initialization up the call stack, I just tried there because it was easiest.

Yeah, that's an option I was going to look into as well. :+1:

activeshadow commented 1 year ago

I tested a quick fix by adding this to vmconfiger_cli#ReadConfig and it fixed the error. No crashes after that though I didn't see things come back up. I'll look into why that was more tomorrow

@jacdavi were you able to do anymore testing to see if things come back up after recovering from a crash?

jacdavi commented 1 year ago

Yeah, so I'm not seeing any errors, but just not seeing anything recover either. I can send you the full logs

jacdavi commented 1 year ago

Ok, after teaching me about namespaces, I haven't seen any issues. @aherna after recovery miniccc works fine for send/exec/mnt

activeshadow commented 1 year ago

Sweet! @aherna and @jacdavi I'll get the commits for this PR squashed and ready for merging ASAP. 👍🏻

activeshadow commented 1 year ago

@aherna @jacdavi okay I think this is ready for final review and merge!

jacdavi commented 1 year ago

@activeshadow did you make any additional changes before squashing? I grabbed a new copy of the branch and am seeing an issue where things recover, but the VMs are mixed up (at least for VNC) in phenix/miniweb.

activeshadow commented 1 year ago

@activeshadow did you make any additional changes before squashing? I grabbed a new copy of the branch and am seeing an issue where things recover, but the VMs are mixed up (at least for VNC) in phenix/miniweb.

Nope no changes @jacdavi just squashed commits.

jacdavi commented 1 year ago

Ok, I think I found the issue. In recovery.go we create a new VM then afterwards reset the ID. But that doesn't appear to update the instancePath variable, so the VM's path may be wrong.

I was consistently having VNC connections mixed up, but have not seen the issue after adding this line to recovery.go line 136

            kvm.instancePath = filepath.Join(*f_base, vm.VMID)

Also, it appears that tags aren't being recovered correctly. For example {"device-type":"server","name":"miniserver7"} turns into {"tags":"name"} after recovery. I haven't looked at a fix for this yet.

Sorry I didn't find these earlier!

activeshadow commented 1 year ago

@jacdavi okay, both issues should be fixed now.

activeshadow commented 1 year ago

@jacdavi @aherna I just now rebased this branch on top of the master branch.

Any updates on when you'll be able to test this again so we can put a bow on it?

jacdavi commented 1 year ago

Hey Bryan,

I’ve been out of town this week, but I’ll test asap on Monday. Sorry for the delay!


From: Bryan Richardson @.> Sent: Thursday, March 9, 2023 1:54:25 PM To: sandia-minimega/minimega @.> Cc: Davis, Jacob @.>; Mention @.> Subject: [EXTERNAL] Re: [sandia-minimega/minimega] [minimega] Recover VMs still running after a minimega crash (PR #1493)

@jacdavihttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjacdavi&data=05%7C01%7Cjacdavi%40sandia.gov%7C028826fa3be04b9d231c08db20e8de19%7C7ccb5a20a303498cb0c129007381b574%7C1%7C0%7C638139956753282142%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=erCFHqSvVESPWZqpdk8o4535psqN2PxcQXB5SlolPaQ%3D&reserved=0 @ahernahttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faherna&data=05%7C01%7Cjacdavi%40sandia.gov%7C028826fa3be04b9d231c08db20e8de19%7C7ccb5a20a303498cb0c129007381b574%7C1%7C0%7C638139956753282142%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3NaXM7syIUhoecP4zm6xvuwGjsO%2B1NLiEw%2BiLf%2FVpMU%3D&reserved=0 I just now rebased this branch on top of the master branch.

Any updates on when you'll be able to test this again so we can put a bow on it?

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsandia-minimega%2Fminimega%2Fpull%2F1493%23issuecomment-1462876699&data=05%7C01%7Cjacdavi%40sandia.gov%7C028826fa3be04b9d231c08db20e8de19%7C7ccb5a20a303498cb0c129007381b574%7C1%7C0%7C638139956753282142%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=yfXivG3%2BOLKEBstsHgKO%2FmQLqdX3merqBHSmWKdLCko%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUU5EOP7UIHOMGKPV4U7DNDW3JGRDANCNFSM6AAAAAAUW5C5SU&data=05%7C01%7Cjacdavi%40sandia.gov%7C028826fa3be04b9d231c08db20e8de19%7C7ccb5a20a303498cb0c129007381b574%7C1%7C0%7C638139956753282142%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1UEYtHMnv6DDiO5ToLYqzDqag6Q29bULF6p1aukqrPw%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

jacdavi commented 1 year ago

Ok, it looks like our AD server isn't getting an IP address despite it being statically defined in the phenix topology. It seems to work fine with the earlier version I have of the PR (just before the first squash). I switched back and forth a few times and found it to be consistent.

I briefly checked the tags and VNC paths and those seem fixed.

activeshadow commented 1 year ago

@jacdavi whoa... that's interesting. I'm assuming that's not the only Windows VM you're deploying? Are the other VMs getting their IPs or are they DHCP?

I don't recall changing the Windows startup script inject, so I'll have to go back and take a look.

More to come.

jacdavi commented 1 year ago

It appears to just be windows VMs getting an IP of 0.0.0.0. Our statically assigned linux VMs look fine. If you need to take a look at our environment, feel free

jacdavi commented 1 year ago

It appears to just be windows VMs getting an IP of 0.0.0.0. Our statically assigned linux VMs look fine. If you need to take a look at our environment, feel free

For reference, this issue seems to have been introduced with #1494, but was only noticed while testing this PR. Everything here works as expected.

activeshadow commented 1 year ago

Thanks for merging @jacdavi! I should have squashed my commits one more time before you did. My bad...

jacdavi commented 1 year ago

@activeshadow thank you for all of your work and help on this! I don't think squashing matters too much, I probably had an option to squash on merge, but I forgot to check ¯\_(ツ)_/¯