nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.78k stars 152 forks source link

Run VMs in container without mounting device #427

Closed leojonathanoh closed 2 years ago

leojonathanoh commented 2 years ago

I have been thinking of using sysbox for all my gitlab-runner ci jobs for security reasons. In particular, for my packer repos which require qemu or virtualbox, will require docker run --devices which may be configured at the runner level https://github.com/nestybox/sysbox/issues/206#issuecomment-821091094 or job level https://github.com/nestybox/sysbox/issues/206#issuecomment-952242750. However, this requires setting up the gitlab-runner container with a --devices /dev/<hypervisor_device>, e.g.

Does sysbox have the ability to run a hypervisor in a container without needing to bind mount the host's hypervisor device i.e. /dev/vboxdrv for virtualbox and /dev/kvm for kvm?

ctalledo commented 2 years ago

Hi @leojonathanoh, thanks for considering using Sysbox, we hope you find it very useful to secure your CI jobs.

Does sysbox have the ability to run a hypervisor in a container without needing to bind mount the host's hypervisor device i.e. /dev/vboxdrv for virtualbox and /dev/kvm for kvm?

Not yet, but it's something we are planning to definitely support.

The reason this does not work yet is that Sysbox secures containers with the Linux user-namespace but this causes devices that you pass from the host to show up inside the container with invalid permissions (e.g., nobody:nogroup).

Sysbox uses techniques (e.g., shiftfs) to work-around it but we've not yet applied or tested those with /dev/kvm for example. We would also need to ensure this does not break isolation of the container.

Curious to hear your use case for running a VM inside the container (since it's often the other way around :) ).

leojonathanoh commented 2 years ago

@ctalledo My general idea of sysbox is that it comes close to making containers replace VMs (or as much as possible). Based on that idea, every "machine" that is currently a VM should be treated as far as possible as a container.

Not yet, but it's something we are planning to definitely support.

The reason this does not work yet is that Sysbox secures containers with the Linux user-namespace but this causes devices that you pass from the host to show up inside the container with invalid permissions (e.g., nobody:nogroup).

Sysbox uses techniques (e.g., shiftfs) to work-around it but we've not yet applied or tested those with /dev/kvm for example. We would also need to ensure this does not break isolation of the container.

This is true, VM-like isolation is the main features of sysbox. I have been wondering whether its possible a sysbox host machine to make its active hypervisor available to its sysbox containers, with the same mechanism or an equivalent, of how a non-sysbox host system makes its active hypervisor available to its VMs by nested virtualization. (I don't have an understanding of the exact mechanism of how nested virtualization is made available to a VM, so i don't really understand if there's any security implications with current implementation of using a bind mount of /dev/kvm using a docker run --volume or docker run --devices that makes it less secure than the traditional VM nested virtualization.

Curious to hear your use case for running a VM inside the container (since it's often the other way around :) ).

In my particular use case, i use packer to build VM images in a ci job. Since my ci jobs run in containers, and i am building kvm or virtualbox images, i need a VM in a container.

So if sysbox had a way to "natively" make hypervisors run in a container, running a "VM-like" container would simply be docker run --runtime sysbox-runc without --devices (or --volume for /dev/kvm). I'm pretty sure that there are more secure ways of bind-mounting /dev/kvm (which im not aware of now), and if there aren't, i'd think the runtime (sysbox in this case) might be able to do that automatically. But if there are indeed more secure ways, then sysbox could use those mechanisms.

ctalledo commented 2 years ago

Hi @leojonathanoh, thank you very much for the thoughtful response.

My general idea of sysbox is that it comes close to making containers replace VMs (or as much as possible). Based on that idea, every "machine" that is currently a VM should be treated as far as possible as a container.

Correct; ideally any workload that runs in a VM would run in a Sysbox container, securely. That includes running VMs. We are working towards this goal.

so i don't really understand if there's any security implications with current implementation of using a bind mount of /dev/kvm

Yes that's exactly what we would need to look at. I am pretty sure /dev/kvm would need to be exposed inside the container, so we need to figure out what that means from a container isolation perspective, and whether there is anything extra Sysbox needs to do to ensure container isolation is not affected.

In my particular use case, i use packer to build VM images in a ci job. Since my ci jobs run in containers, and i am building kvm or virtualbox images, i need a VM in a container.

Follow up question: how are you doing this right now (i.e., without Sysbox)? Are you using privileged containers, or regular containers with a mount of /dev/kvm?

leojonathanoh commented 2 years ago

Correct; ideally any workload that runs in a VM would run in a Sysbox container, securely. That includes running VMs. We are working towards this goal.

Yes that's exactly what we would need to look at. I am pretty sure /dev/kvm would need to be exposed inside the container, so we need to figure out what that means from a container isolation perspective, and whether there is anything extra Sysbox needs to do to ensure container isolation is not affected.

Glad to know the project's direction. I really look forward to replacing as many of my systems as possible with ephemeral containers, and it'll be nice if sysbox could do that.

Follow up question: how are you doing this right now (i.e., without Sysbox)? Are you using privileged containers, or regular containers with a mount of /dev/kvm?

Right now, i haven't had time to try kvm yet (will be working on it over the next few weeks), but ive had success with virtualbox VM builds . Does not require --privileged, and works without --runtime sysbox:

EDIT: fixed typo.

# Error
$ docker run --rm -it \
    --entrypoint "" \
    theohbrothers/docker-packer:20211103.0.0-1.7.7-virtualbox-ubuntu-20.04 bash
root@4aa4415357e1:/# vboxmanage | head -n5
WARNING: The character device /dev/vboxdrv does not exist.
         Please install the virtualbox-dkms package and the appropriate
         headers, most likely linux-headers-generic.

         You will not be able to start VMs until this problem is fixed.

# Success
$ docker run --rm -it \
    --device /dev/vboxdrv:/dev/vboxdrv \
    --entrypoint "" \
    theohbrothers/docker-packer:20211103.0.0-1.7.7-virtualbox-ubuntu-20.04 bash
root@4ed5a3279b5c:/# vboxmanage | head -n5 
root@4ed5a3279b5c:/# packer build ...

I only need --privileged when i need to use mount command to do some .iso or .vhd mounting in the container:

# Error
$ docker run --rm -it \
    --device /dev/vboxdrv:/dev/vboxdrv \
    --entrypoint "" \
    theohbrothers/docker-packer:20211103.0.0-1.7.7-virtualbox-ubuntu-20.04 bash
root@44cb27458165:/# mkdir -p a b
root@44cb27458165:/# mount -o bind a b
mount: /b: permission denied.

# Success
$ docker run --rm -it \
    --device /dev/vboxdrv:/dev/vboxdrv \
    --privileged \
    --entrypoint "" \
    theohbrothers/docker-packer:20211103.0.0-1.7.7-virtualbox-ubuntu-20.04 bash
root@HOST:/# mkdir -p a b
root@HOST:/# mount -o bind a b

Also, I think i was a little misleading in my response https://github.com/nestybox/sysbox/issues/427#issuecomment-960062805, --device should work for /dev/kvm, and is preferred to bind mounts, because i believe effectively --device is really just a bind mount with some device-specific config syntax options.

ctalledo commented 2 years ago

Thanks, I understand.

--device should work for /dev/kvm, and is preferred to bind mounts, because i believe effectively --device is really just a bind mount with some device-specific config syntax options.

That's correct.

I haven't had time to try kvm yet (will be working on it over the next few weeks),

Cool, let me know what you find with KVM.

I don't have a lot of cycles right now (unfortunately), but I'll take a look in the coming days to see if there is a low-hanging fruit that would allow Sysbox to show the device with the proper permissions & isolation inside the container.

leojonathanoh commented 2 years ago

I've begin some work on kvm, by attempting to mount a .vmdk inside the container using guestmount which creates a kvm VM under the hood and uses fusermount. However the fusermount fails even while using --cap-add SYS_ADMIN and --device /dev/fuse. The only way was to use --privileged. This was without --runtime sysbox-runc:

docker run --rm -it \
    --device /dev/fuse \
    -v my.vmdk:/my.vmdk \
    --entrypoint "" \
    theohbrothers/docker-packer:master-93a19e2-1.7.7-virtualbox-ubuntu-20.04 bash
root@7f8cd6dbd214:/# mkdir -p /raw
root@7f8cd6dbd214:/# guestmount -a /my.vmdk -m /dev/sda1 --ro raw/

fusermount: mount failed: Operation not permitted
libguestfs: error: fuse_mount failed: raw/, see error messages above

With linux-image-generic package installed in the docker image, it appears kvm does work without needing --device /dev/kvm. This inability to mount a fuse filesystem is really a docker issue https://github.com/docker/for-linux/issues/321

ctalledo commented 2 years ago

Re-reading this issue:

Does sysbox have the ability to run a hypervisor in a container without needing to bind mount the host's hypervisor device

No, the host's KVM device would always need to be mounted into the Sysbox container; otherwise the container won't have the means to run the VM inside of it.

However, once the KVM device is exposed inside the container then it should be possible to launch a VM from within a Sysbox container (inverting the traditional way of running infrastructure).

Closing this issue since it's not possible to run the VM inside the Sysbox container without exposing the host's KVM device (or similar) into it.