lima-vm / lima

Linux virtual machines, with a focus on running containers
https://lima-vm.io/
Apache License 2.0
15.03k stars 589 forks source link

vz: lima managed vm hangs with high CPU usage intermittently. #1609

Open vsiravar opened 1 year ago

vsiravar commented 1 year ago

Problem

Virtualization Framework intermittently starts consuming 100%-220%(from Activity Monitor) CPU and is unresponsive. This leads to all limactl commands being unresponsive or failing. This intermittently happens when the lima vm is started and left alone for a while.

Behaviour observed

Once the vm gets to this state All limactl commands fail.

Workaround

The way around it is to recreate vm.

Related issue

https://github.com/docker/for-mac/issues/6655

Expected behaviour

That the vm should not hang when the computer wakes up from sleep.

Host info

macOS version: 13.4
cpu brand: Apple M1 Pro
lima version: 0.16.0
balajiv113 commented 1 year ago

@vsiravar is this consistently reproducible ?

Before sleep was there any high intensive task running in vm ?

vsiravar commented 1 year ago

@vsiravar is this consistently reproducible ?

No, it's quite intermittent.

Before sleep was there any high intensive task running in vm ?

Not really, I just have a hello-world container running in the vm. I have not experienced this behaviour with qemu.

balajiv113 commented 1 year ago

@vsiravar With current master we now have support for video display. If possible could you enable display and try to replicate the same ??

When it hangs you can check from ui and see if vm is accessible. This will give an idea if the issue is with network/with vm itself

vsiravar commented 1 year ago

With current master we now have support for video display. If possible could you enable display and try to replicate the same ??

Sure, will try this out. Thanks!

ningziwen commented 1 year ago

I think this doesn't only happen when computer wakes up from sleep...

I successfully initialized the VM and ran some commands normally. But after I reply several messages in Slack and come back (around 10 mins), it starts to hang and return FATA[0928] exit status 255.

VM Service has 300% + CPU usage.

Screenshot 2023-06-08 at 3 33 14 PM
balajiv113 commented 1 year ago

@ningziwen could you also try enabling display as mentioned above and see ??

Also do share you template which you used.

ningziwen commented 1 year ago

@balajiv113 Sorry I didn't get what it means. Would you like to do screen recording and upload the video? Or using any GUI? Could you point me the instruction if it is GUI?

balajiv113 commented 1 year ago

@ningziwen Steps to enable display

This will give us a idea if there are some issues with network/whole vm itself.

balajiv113 commented 1 year ago

I tired the above steps myself. Haven't got high cpu usage but the freeze happens.

On checking the GUI during the freeze even that was not responsive so i think the freeze happens on virtualization.framework level not on network.

I have also raised a support ticket with Apple with the same info.

Note: This happens to me on M1 only. My intel runs smooth for weeks with sleep and wake cases

ningziwen commented 1 year ago

@balajiv113 Hey. Did you get any reply from Apple? Is the support ticket link sharable?

vsiravar commented 1 year ago

Updated ticket description and title based on new behaviour observed.

ryancurrah commented 1 year ago

Maybe once https://github.com/lima-vm/lima/issues/1659 is resolved you can look at the serial.log to see if there is any related log messages.

bsideup commented 1 year ago

Confirming that this is still happening (HEAD as of today, M1)

outcoldman commented 1 year ago

I am also experiencing the same issue. Just started using limactl instead of other VM providers. First had to deal with the time shift, so I have added the following

timedatectl set-ntp no
apt update
apt install -y ntp

Now, every morning get to the high CPU usage, and cannot access my VMs.

kj-creater commented 11 months ago

I started a lima virtual machine with the following command, and logged in to the virtual machine background from video using root

limactl create --name=default template://docker \
--cpus=2 --memory=4 --vm-type=vz --mount-writable=true \
--disk=5 --network=lima:user-v2 --rosetta --video

limactl start

How can I confirm whether it is a problem with the virtual machine network or the m1 virtualization service?

I have encountered both of the following situations:

  1. When I run lima date -R in the terminal to freeze, I can confirm from the video that the virtual machine is still running and the CPU usage is not high;
  2. When I run lima date -R in the terminal to freeze, I can confirm from the video that the virtual machine has stopped and the Virtualization process takes up 200% of the CPU resources;

How can I help identify the problem in the above two situations?

lima version 0.18.0 macOS version 14.0 (23A344)

kj-creater commented 11 months ago

When I wrote the above the second scenario happened

  1. There is no error message in the ha.stderr.log file
  2. Virtualization process CPU usage is 200%
  3. video frozen image
terev commented 10 months ago

@balajiv113 Was able to catch the following in the network log when this occurs:

time="2023-11-12T19:22:25-05:00" level=info msg="new connection from  to "
2023/11/12 19:22:28 tcpproxy: for incoming conn 127.0.0.1:56720, error dialing "192.168.104.1:22": connect tcp 192.168.104.1:22: connection was refused
time="2023-11-12T19:22:44-05:00" level=error msg="r.CreateEndpoint() = connection was refused"

Unsure if this is relevant. The network process seems to remain alive.

terev commented 10 months ago

I tried disabling rosetta but that did not help. Something interesting I noticed though is that after disabling rosetta, when the vm hangs, cpu is pinned at half the allocated cpu. Pinned at 100% when allocated 2 cpu. But when rosetta is enabled it's usually pinned at 200%.

cdfmlr commented 10 months ago

After upgrading my M2 Mac mini to Sonoma, I've been encountering this issue frequently. Yesterday, I noticed that one of my Lima VM and an UTM VM (both utilizing the virtualization.framework) froze simultaneously.

The UTM VM works after killing and restarting it. However, the Lima VM fails to restart after a lima stop -f. When I use lima start, that VM encounters errors similar to issue #1915 (by what I remember from, the logs are lost). Recreating the VM solves the problem.

In addition, my Colima VM, also running on vz, has been experiencing frequent hangs as well. I can always resolve it by using the lima stop -f command and then restarting it.

terev commented 10 months ago

I'm able to reproduce this issue almost every time when starting a large docker compose project (which I'm unable to share unfortunately). Today I noticed something new. I opened the system log utility to view any logs related to virtualization during one of these events. Doing so I was able to get some logs that seem interesting:

default 16:56:39.933077-0500    symptomsd   Received CPU usage trigger: 
  com.apple.Virtualization.Virtual[72861] () used 90.01s of CPU over 177.06 seconds (averaging 50%), violating a CPU usage limit of 90.00s over 180 seconds.
default 16:56:40.028006-0500    symptomsd   RESOURCE_NOTIFY trigger for com.apple.Virtualization.Virtual [72861] (90009971208 nanoseconds of CPU usage over 177.00s seconds, violating limit of 90000000000 nanoseconds of CPU usage over 180.00s seconds)
default 17:18:27.814709-0500    runningboardd   Periodic Run States <RBProcessState| identity:xpcservice<com.apple.Virtualization.VirtualMachine([anon<limactl>(502):72856])(502)>:72861 role:UserInteractive gpuRole:None explicitJetsamBand:0 memoryLimit:Inactive(Default) flags:60 guaranteedRunning:NO legacyFinishTaskReason:0 inheritances:<RBMutableInheritanceCollection| inheritancesByEnvironment:{

    }> primitiveAssertions:[
    <RBSProcessAssertionInfo| type:2 reason:20246 name:"Domain" domain:"com.apple.launchservicesd:RoleUserInteractive" expl:"uielement:72861">
    ]>

These logs occur very close to when the the vm begins to hang. From my naive perspective this kind of seems like the os may be killing the virtualization process or severely throttling it for using too much cpu. Does that seem possible? I tried setting the vm's cpu limit to the number of cores my machine has but am still able to reproduce this. Side note: I'm strangely able to set the number of cpu to a number larger than my machine has.

The final log occurs some time after the vm begins to hang.

n-io commented 5 months ago

I tired the above steps myself. Haven't got high cpu usage but the freeze happens.

On checking the GUI during the freeze even that was not responsive so i think the freeze happens on virtualization.framework level not on network.

I have also raised a support ticket with Apple with the same info.

Note: This happens to me on M1 only. My intel runs smooth for weeks with sleep and wake cases

I have the same issue with qemu. Running the same command will sometimes work and sometimes freeze the vm, requiring a stop --force, with CPU usage being somewhere around 400%. However, the 400% CPU usage occur on the qemu-system-x86_64 task. I'm on an M2 Mac and am using cpuType:\ x86_64: "max" in my config, using qemu v8.2.1.

You have already raised a ticket with Apple, but would it be possible to double-check and confirm if in your scenario the behaviour is reproducible using qemu instead of vz?