Can I be pointed in the right direction for getting Docker running on nilrt?

Greg-Freeman commented 11 months ago

I have gone through all the steps to rebuild the kernel, make sure all the correct options that Docker requires are enabled, mounted the cgroups but when I try to start up Docker I get the following error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: openat2 /sys/fs/cgroup/cpuset/docker/cpuset.cpus: no such file or directory: unknown. ERRO[0001] error waiting for container:

This is eerily similar to issues I've seen with running docker on Fedora when searching around. But all the solutions for this require setting systemd options such as systemd.unified_cgroup_hierarchy which doesn't exist on the Rio and can't be installed using opkg as far as I can tell (and may not even be supported on the rio anyways).

It looks like prefixes are missing on the files that docker expects and I haven't been able to figure out how to get those back. I don't know at what level this is handled when mounting, so I can't even begin to guess what I'd need to do to change it.

I know this is probably a bit out of scope of this repo, but I also assume others may want to go this direction for simulating rios. So anything that can be done to point me in the right direction would be helpful.

I am on branch 23.5/5.15

gratian commented 11 months ago

Summary

The 23.5 kernel should have all the required Docker options enabled and should not require rebuilding, see PR #115
You are running into an incompatibility between LabVIEW RT, using the old style cgroup v1 mount points without prefixes, and Docker which assumes new style cgroup mount points (see Docker issue #689).
We are working on making Docker an officially supported feature. It is available for testing starting with the 23.8 release if you use programming environments other than LabVIEW RT.
Work to upgrade LabVIEW RT to cgroups v2 (and resolve the incompatibility) is on the schedule for the end of 2024. In the meantime there are some workarounds listed below that may work for you depending on your use case.

Details

LabVIEW RT uses cgroups v1, specifically the cpuset and cpuacct cgroup controllers, to implement some of the RT SMP CPU Utilities VIs: RT Get CPU Loads.vi, RT Set CPU Pool.vi, RT Set CPU Pool Assignments.vi, and RT Set CPU Pool Sizes.vi. It is also used when assigning timed structures to a CPU pool or if you want to manually set the CPU affinity for the ScanEngine I/O_Scan thread.

The init scripts used to mount and create the cgroup hierarchies are:

/etc/rc5.d/S01nicreatecpusets (symbolic link to /etc/init.d/nicreatecpusets)
/etc/rc5.d/S02nicreatecpuacctgroups (symbolic link to /etc/init.d/nicreatecpuacctgroups)

The current implementation is based on cgroups v1 and uses mount points with the noprefix option set. For the cpuset cgroup v1 controller mounted by nicreatecpusets this creates an incompatibility with Docker (among other things).

We are working on making Docker an officially supported feature for NI Linux RT:

Required kernel options have been enabled by PR #115 and shipped in 23.5.
For use-cases where LabVIEW RT is not installed (i.e. you select "Other (Select to use C, C++, Python, etc.)" at install time) support for Docker has been enabled starting with 23.8, see PR #612.
Work has been scheduled to resolve the incompatibility and upgrade LabVIEW RT to cgroups v2 in 2024 (requires more involved LabVIEW source code changes hence the timeline).

Workarounds

For versions 23.5 and earlier if you do not use the RT SMP CPU Utility VIs you can remove the /etc/rc5.d/S01nicreatecpusets file on target. This will leave the cpuset cgroup controller under OS control. LabVIEW RT will print some warnings but for the most part will work OK (if you don't use those SMP VIs).
For versions starting with 23.8:
- If LabVIEW RT is not installed Docker should be installable from package feeds and work out of the box.
- If LabVIEW RT is installed you can force cgroups to be under OS control by setting LVRT_CGROUP_VERSION=0 in /etc/default/lvrt-cgroup. Same caveats apply as for 1.

Greg-Freeman commented 11 months ago

Thanks for the incredibly thorough response. It's really appreciated.

So, you are right. There is no need to rebuild the Kernel. I was doing that when I was working with LabVIEW 2018 but realized I had since bumped it up to 23.1. Right now I reverted back to the state where LVRT is installed and it prints "Welcome to LabVIEW Real-Time 23.1f272.

I went ahead and renamed the S01nicreatecpusets file, but one thing I noticed is that I still have the issue with docker and the docker error is referencing /sys/fs/cgroup/cpuset/docker/cpuset.cpus.

But it looks like everything related to the S01nicreatecpusets file mounts in /dev/cgroup/cpusets.

Admittedly I don't know enough to know what overlaps may exist between /sys/fs/cgroup and /dev/cgroup but unfortunately moving the S01nicreatecpusets file did not resolve the issue.

gratian commented 11 months ago

Can you try installing cgroup-lite from package feeds and reboot? (i.e. opkg update and opkg install cgroup-lite). That package should have the init scripts required to mount the /sys/fs/cgroup hierarchy. It should have been installed as a dependency to the docker package but maybe something went wrong during install.

Greg-Freeman commented 11 months ago

I goofed. I wasn't thinking. Instead of removing the symbolic link I just renamed it to .bak. But of course this doesn't help since the link still exists pointing at the real file that's still there. Deleting the link worked, as did renaming the raw file in etc/init.d

gratian commented 11 months ago

Yeah, adding a suffix like .bak won't work because SysV init looks for files starting with S or K and doesn't care about the extension. You can rename the symlink to skip it by prefixing it with an underscore or something (really anything other than S or K).

If things are working now, do you mind closing this issue?

Greg-Freeman commented 11 months ago

If things are working now, do you mind closing this issue?

Sure thing.

For what it's worth, for what we're attempting to do, I needed to use MACVLAN. We have layer2 communication that's being used from 30 or so years ago that hasn't been refactored out yet. So that did require a rebuild of the kernel with that option enabled, but it was pretty straight forward with the instructions provided in the other repo..

Once I had that, tshark saw all the layer2 data coming though no problem.

Not that it much matters to you guys, but I am putting it here for completeness sake.

Greg-Freeman commented 11 months ago

After getting this to work on a high level, I'm curious, is it doable to run lvrt inside a docker container? Or is that a fools errand? I tried my best to create a docker image from a fresh LVRT install on a rio, then kick off some of the init scripts. It kinda-sorta gets started. But I don't know if it's useable in any meaningful way.

To provide some context, we have a system that interfaces with multiple different software applications and has up to 8 rios. We'd love for developers to be able to simulate the full system for their independent development without requiring a full hardware setup. We are able to create a VM from the RT recovery ISO on a hyper-v VM. But standing up 8 docker containers would be way more manageable than 8 VMs.

gratian commented 11 months ago

@Greg-Freeman having LV RT inside a docker container is a customer requested feature we are looking into.

At this time I would say it is going to be complicated to hack together. We are still working on getting bridge networking to work correctly and out of the box with docker on NI Linux RT and you're going to need that in order to re-map the network ports used by LV RT (in order to run more than one LVRT instance).

Greg-Freeman commented 11 months ago

@Greg-Freeman having LV RT inside a docker container is a customer requested feature we are looking into.

At this time I would say it is going to be complicated to hack together. We are still working on getting bridge networking to work correctly and out of the box with docker on NI Linux RT and you're going to need that in order to re-map the network ports used by LV RT (in order to run more than one LVRT instance).

Understood. I actually got it working and I am able to get the network to bridge with my wifi and DHCP as eth0, but if I configure a static IP through NI MAX things go a bit haywire. I'll play around with it a bit more, but since we are reading layer 2 data, using a bridged network to our VM + macvlan, some of the traditional docker "best practices" have certainly been thrown out the window. I'd agree, it's definitely hacked together, but I'm basically at the point where if it works for our purposes that's good enough for me. Right now I have spun up 3 docker containers, each running an RT EXE that prints to the console window. The deployment is definitely a hack job, but it may be good enough to get us over the hump for the next year.

All that said, the best way to describe this is "we'll see"

gratian commented 11 months ago

Awesome.

ni / linux