rivosinc / salus

Risc-V hypervisor for TEE development
96 stars 25 forks source link

Increase the default number of imsic file to 5 #307

Closed atishp04 closed 1 year ago

atishp04 commented 1 year ago

If host scheduler tries to schedule more number of vcpus on a pcpu that its number of imsic file, it will result in TVM failure. This is a common scenario when we run two TVMs with 2 vcpus which is a common test case.

Increasing the number of imsic files only helps to run the bare minimum setup in multi-TVM SMP configuration.

In the long run, the host kernel can't do much about it except notifying the VMM about the correct reason i.e. insufficient imsic files. Its upto the VMM to decide whether it wants to show appropriate error to the user or perform lazy pinning if possible.

abrestic-rivos commented 1 year ago

This change is fine and all, but:

If host scheduler tries to schedule more number of vcpus on a pcpu that its number of imsic file, it will result in TVM failure. This is a common scenario when we run two TVMs with 2 vcpus which is a common test case.

What exactly is the failure that's happening here? KVM should "unbind" vCPUs (i.e. swap them to SW interrupt files) if the physical CPU is over-subscribed. When there's assigned devices it's a different story, but we're not there yet.

In the long run, the host kernel can't do much about it except notifying the VMM about the correct reason i.e. insufficient imsic files. Its upto the VMM to decide whether it wants to show appropriate error to the user or perform lazy pinning if possible.

I feel like we're eventually going to need something better than "punt it to the VMM" in the long term. Pinning of vCPUs has generally been an optional optimization, not a requirement for functional VMs.

rajnesh-kanwal commented 1 year ago

This change is fine and all, but:

If host scheduler tries to schedule more number of vcpus on a pcpu that its number of imsic file, it will result in TVM failure. This is a common scenario when we run two TVMs with 2 vcpus which is a common test case.

What exactly is the failure that's happening here? KVM should "unbind" vCPUs (i.e. swap them to SW interrupt files) if the physical CPU is over-subscribed. When there's assigned devices it's a different story, but we're not there yet.

In the long run, the host kernel can't do much about it except notifying the VMM about the correct reason i.e. insufficient imsic files. Its upto the VMM to decide whether it wants to show appropriate error to the user or perform lazy pinning if possible.

I feel like we're eventually going to need something better than "punt it to the VMM" in the long term. Pinning of vCPUs has generally been an optional optimization, not a requirement for functional VMs.

The problem is not that we don't have enough files, actually we can not free IMSIC files immediately so that other VCPUs can take those. We can not free an IMSIC file before rebind flow completes. So for some time a VCPU holds two IMSIC files at a time. This time slot is quite high so this issue occurs quite frequently and is easily reproduceable.

Here is the flow for rebinding in Linux.

  1. Acquire free file and mark as taken in bitmap.
  2. Complete the whole rebinding flow. (Begin, Fence, Clone, End).
  3. Mark previous IMSIC file as free in bitmap.

Now for example we have two VMs with two VCPU each and 2 PCPU machine.

Core0 Core1 VM0C0 VM0C1 VM1C0 VM1C1

We hit out-of-imsic files issue for the case when scheduler tries to move both VCPUs around as shown below. Core0 Core1 VM0C1 VM0C0 VM1C1 VM1C0

Now before VM0C0 and VM1C0 free their previous IMSIC files of Core0, VM0C1 and VM1C1 kick in and try to look for free IMSIC files. One of them will get it as we have total 3 files but other one will not.

Also we will hit the same case for below config. If the scheduler decides to move VM0C1 to core0 and VM1C0 to core-1 we will end-up in same problem. Core0 Core1 VM0C0 VM0C1 VM1C1
VM1C0

As you have suggested kvm should move to sw file instead, which I agree but support for runtime migration between HWACCEL and EMUL is missing in KVM AFAIK. I think working around this for now is better in-terms of getting the rfc out. We can then look into this issue later on.

Just a vague thought: We can also add another bitmap where we can record if there are any files which are in the freeing process and then the VCPU can wait (timed) rather than immediately going through the whole sw-file path.

abrestic-rivos commented 1 year ago

As you have suggested kvm should move to sw file instead, which I agree but support for runtime migration between HWACCEL and EMUL is missing in KVM AFAIK. I think working around this for now is better in-terms of getting the rfc out. We can then look into this issue later on.

Yeah, I'm fine with pushing this in, but it highlights a pretty significant problem with the AIA support in KVM. IMO at least two things are necessary:

Anyway, I'll go ahead and merge this, but we should think more about how best to handle this on the KVM side.

atishp04 commented 1 year ago

This change is fine and all, but:

If host scheduler tries to schedule more number of vcpus on a pcpu that its number of imsic file, it will result in TVM failure. This is a common scenario when we run two TVMs with 2 vcpus which is a common test case.

What exactly is the failure that's happening here? KVM should "unbind" vCPUs (i.e. swap them to SW interrupt files) if the physical CPU is over-subscribed. When there's assigned devices it's a different story, but we're not there yet.

In the long run, the host kernel can't do much about it except notifying the VMM about the correct reason i.e. insufficient imsic files. Its upto the VMM to decide whether it wants to show appropriate error to the user or perform lazy pinning if possible.

I feel like we're eventually going to need something better than "punt it to the VMM" in the long term. Pinning of vCPUs has generally been an optional optimization, not a requirement for functional VMs.

Currently, it just shows KVM_FAIL_ENTRY which is not indicative of what happened. We probably need a better error code at least.

The problem is not that we don't have enough files, actually we can not free IMSIC files immediately so that other VCPUs can take those. We can not free an IMSIC file before rebind flow completes. So for some time a VCPU holds two IMSIC files at a time. This time slot is quite high so this issue occurs quite frequently and is easily reproduceable.

Yeah. I am suspecting this(rebind process time) is also the root cause for tvm slowness we are observing. That's why, we don't see that while pinning vcpus.

Here is the flow for rebinding in Linux.

  1. Acquire free file and mark as taken in bitmap.
  2. Complete the whole rebinding flow. (Begin, Fence, Clone, End).
  3. Mark previous IMSIC file as free in bitmap.

Now for example we have two VMs with two VCPU each and 2 PCPU machine.

Core0 Core1 VM0C0 VM0C1 VM1C0 VM1C1

We hit out-of-imsic files issue for the case when scheduler tries to move both VCPUs around as shown below. Core0 Core1 VM0C1 VM0C0 VM1C1 VM1C0

Now before VM0C0 and VM1C0 free their previous IMSIC files of Core0, VM0C1 and VM1C1 kick in and try to look for free IMSIC files. One of them will get it as we have total 3 files but other one will not.

Also we will hit the same case for below config. If the scheduler decides to move VM0C1 to core0 and VM1C0 to core-1 we will end-up in same problem. Core0 Core1 VM0C0 VM0C1 VM1C1 VM1C0

As you have suggested kvm should move to sw file instead, which I agree but support for runtime migration between HWACCEL and EMUL is missing in KVM AFAIK. I think working around this for now is better in-terms of getting the rfc out. We can then look into this issue later on.

Just a vague thought: We can also add another bitmap where we can record if there are any files which are in the freeing process and then the VCPU can wait (timed) rather than immediately going through the whole sw-file path.

atishp04 commented 1 year ago

As you have suggested kvm should move to sw file instead, which I agree but support for runtime migration between HWACCEL and EMUL is missing in KVM AFAIK. I think working around this for now is better in-terms of getting the rfc out. We can then look into this issue later on.

Yeah, I'm fine with pushing this in, but it highlights a pretty significant problem with the AIA support in KVM. IMO at least two things are necessary:

  • A mode that indicates the vCPU is allowed to be swapped out to a SW file, but must use a real IMSIC file when running.

Yeah. For normal virtualization, KVM has AUTO mode which migrates to a swfile if vsfile allocation fails does. But it doesn't mandate hw imsic file when running and relies on trap n emulate. Currently, it will only tries to allocate hw imsic file only if vcpu is migrated to a new pcpu.

  • Support for evicting idle vCPUs to SW files.

yeah.

Anyway, I'll go ahead and merge this, but we should think more about how best to handle this on the KVM side.