Closed dalehamel closed 5 years ago
Good catch @dalehamel - I think that the solution you are proposing is how we want to address this issue. Another possibility could be to release the headers in the image or make a specialized image for google container os users.
Actually after a bit more digging, I think a better method would be to grab the sources for the specified build id directly from google storage, as sysdig appears to. You can just read it from /etc/os-release, and craft the URL, eg:
wget https://storage.googleapis.com/cos-tools/$BUILD_ID/kernel-src.tar.gz
This, and setting BPFTRACE_KERNEL_HEADERS
appropriately should do the trick. Unfortunately the sources are quite large (~800mb decompressed), so maybe a clever exclude filter to tar could strip out unnecessary sources, leaving only headers?
Another possibility could be to release the headers in the image
The trick would be matching the sources to the image, which is on a rolling release window and not a fixed release schedule like other major distros.
make a specialized image for google container os users.
This probably makes more sense as it's a bit of a niche case. I'll take a look at where the container image is selected and see if I can figure out a sensible way to automatically override the image if container os is detected.
Great @dalehamel I suggest you to take a look at the tracejob and version packages. I think that since there are those rolling releases we probably need to build the specialized images and then have a way to figure out at runtime and download the proper headers. However covering all the possible cases can become complex. Another thing that would minimize this a bit is to always include headers for minor versions and select them based on the kernel, we will not have always the most accurate headers but that's better than nothing for those cases that we didn't do a specific implementation for like we are doing now for google container os.
So:
Yeah generally speaking they shouldn't have to match exactly as the header apis be pretty stable. Targeting for specific kernel versions would be enough, yeah. But I think the issue is less that, and more the gamut of trying to support all of the possible versions.
I agree with your fallthrough logic, generally speaking i think that the only time you should need to do special handling for google is if the headers are detected to be missing and it's detected to be google's container os.
I think that it would probably make sense have an option to look for a tarball containing acceptable headers from an external source, to either bake this into the image that's used, or to try and pull down on booting the tracing container (perhaps both). This allows for caching (user only needs to maintain docker images that have the supported headers for the OS images they are targeting), while still being flexible (if matching headers aren't available, try and download them instead of failing hard).
Anyways, I'll get some code together, hopefully next week.
Side note... I wish that they would just bundle a tarball with the kernel headers like every other distro, but I can't figure out who to talk to in order to make this happen. I can understand why this is difficult, as the OS is basically meant to be immutable and this is probably an unanticipated use-case.
I 100% agree with your thoughts, looking forward some code from you, thanks for offering some effort to help getting this done.
I suggest you to take a look at the tracejob and version packages.
Thanks for the tips, I see what's going on here now. It looks like the image is hardcoded:
https://github.com/iovisor/kubectl-trace/blob/3e8de867ab6a3a13b9f08dca0469d331253470cb/pkg/tracejob/job.go#L254 https://github.com/iovisor/kubectl-trace/blob/9e635f2c904c18552ccc79f163bad78d81367f84/pkg/version/version.go#L27-L37
So a mechanism to override that from the defaults (I don't see anything that would allow this right now) would allow for this use case to be served with minimal code changes (provided the user can point this to their own vendored image containing the headers). I think I'll try and muck around with that locally, as a basic proof of concept first so that I can at least get this working with a custom image.
Since this logic is run from the user's machine, it's going to be a bit difficult to dynamically determine what container should be run :thinking: perhaps there can be some hints if we inspect the kubernetes node
object, but I'm not sure how reliable that would be.
Aside from that, an on-boot script in the image itself should have everything it would need to know, just would need to add /etc/os-release
as a read-only mount.
It looks like kubectl-trace currently assumes that your headers are good to go at /lib/modules, but perhaps a more clever way to approach this would be to put them somewhere else and override that directly, and use the BPFTRACE_KERNEL_HEADERS
to override this within the container. This seems like an appropriate use case for an initContainer, which could just boot up to quickly check if headers are good to go, and otherwise download / mount them somewhere that'll be accessible to the actual trace container.
Before I go to wild on speculation though, I'll tinker a bit and get an mvp working, and worry about the right way to do this later.
Thanks for pinging me on this issue. I wasn't aware that this project is now part of iovisor
. Glad to see this effort is moving along!
Having the headers (not necessarily the full source tree as it is large) baked into special GKE images would be a great user experience. No additional start up time waiting for headers/source to download. Some complications with this are:
One idea I had was to bake the headers for all the container-os kernel versions supported by GKE into a single image and then just have an entrypoint script set the right value for BPFTRACE_KERNEL_SOURCE
based on uname
. This is a moving target so this would have to be automated in a CI system somewhere to stay current with the images available for GKE. Also, this puts an additional cost on the user who now has to download headers for kernel versions they are not using. Hopefully, this would be compressed efficiently by docker so it wouldn't be much of an issue. Also, this image would be cached on the node so you would only have to download the image once per node.
An initContainer
that downloads the headers is another solution I've considered. It has the downside of forcing the user to download these headers every time the Pod
is scheduled. I feel this would be a fairly high cost to the user experience.
Another solution would be to provide a DaemonSet
that would run a pod on all nodes that would do the work of downloading the headers once for each node. Then the pods that are created for the actual jobs would just volume mount the kernel headers from the host. This might be tricky as, from what I remember, most of the host's filesystem is mounted readonly. Optionally, you could provide an initContainer that blocks until the headers are downloaded and ready.
bake the headers for all the container-os kernel versions supported by GKE into a single image
That could work, but depending on the size the first pull would be slow, but after that shouldn't be a problem and have decent performance. Just need a way of tracking possible GKE releases for the image build / maintaining a list of targets.
GKE doesn't require the use of container-os, you can specify alternative VM images to use
Yeah, there is an ubuntu flavor as well. I imagine the headers are much easier to fetch for that. You raise a good point that the solution for GKE should be generalized enough to handle whatever flavors of OS image are available (my most recent knowledge was an optimized ubuntu, as well as container OS).
I'm going to focus on container OS, as that's the problem that is more practical for my use-case to implementing a fix here, and I think that the ubuntu implementation will be pretty easy to add as it should just be able to mount in the headers from the host via the /lib/modules
hostmount (it may well work already).
You don't really know what version of container-os you are running until you are running on the Node.
I can think of ways to get it from where kubectl is running, but none that are very elegant. If you had a way of getting this into the node metadata, such as an on-boot daemonset that writes the build id after reading from /etc/os-release, eg:
$ cat /etc/os-release
BUILD_ID=10895.91.0
NAME="Container-Optimized OS"
KERNEL_COMMIT_ID=ff03fe06c0fc35868c5ada1306e9471a48bec9c3
...
However, I'm not certain that the client (probably dev on their workstation with kubectl) actually needs to have this information. It's probably best to keep the client logic as simple as possible, and push this out to the tracing container, where it's trivial to fetch the build id.
An initContainer that downloads the headers is another solution I've considered. It has the downside of forcing the user to download these headers every time the Pod is scheduled. I feel this would be a fairly high cost to the user experience.
I think that an initContainer solution is probably the simplest, and can be optimized to make it be a good experience still. In the environment that I am working in, I already have a CI pipeline for chromiumos that generates kernel headers from the gentoo ebuild that google uses for container os, so I can just fetch a small archive that's already got the stripped headers that can just be fetched from a private GCS.
That's not really feasible for open source though, I think that a more reasonable approach to do this reliably is to do something similar to what sysdig does for an init container. If you look at how sysdig handles this, they detect the build ID and download the sources on-the-fly, see https://github.com/draios/sysdig/issues/1061 which gives a decent overview of how this can be done.
This might be tricky as, from what I remember, most of the host's filesystem is mounted readonly.
At least in our environment, this is pretty easy. There is a writeable partition at /mnt/stateful_partition
, where these could reside. Just picking a standardized folder, such as:
/mnt/stateful_partition/kernel_headers/$BUILD_ID
The initContainer would then only need to do this once per machine, with all other uses of the tracer on the same node being able to use these cached headers by sharing this read-only mount. The initContainer could just bail quickly (and use the same image as the target container) if the sources have already been downloaded and installed correctly.
If this approach seems sane to y'all I'll start implementing the script for an initContainer, and inject it into the spec. The initContainer could probably be generally useful as an entrypoint for ensuring that the correct headers are installed, and extended to help fix quirks for other platforms if necessary.
Based on the sysdig work and @jasonkeene's script, I've cooked up this script, which will ensure that the headers are installed on container OS:
https://gist.github.com/dalehamel/15efcf02e6fd7999b5a151cd2b5702e8
On the first run, it takes 24 seconds to fetch, extract, and generate the necessary headers to the correct location. After this, if the script is run again it returns nearly instantly thanks to the dotfile checks.
So basically, this (well, a cleaned-up version with a switch statement to handle different distros probably) would serve as the initContainer's main script, which would just need to be configured with a path to install the headers to, and a mount for /etc/lsb-release
from the host. The first time that a machine attempts to trace something this will install the headers on the machine, to be shared by all future tracers.
All the trace container needs to do is mount in the directory prepared by the initContainer and everything should just work :tm:
@fntlnz do you have any thoughts on the initContainer approach? IMO this is the cleanest separation of concerns, the impacts to kubectl-trace at large are:
BCC_KERNEL_SOURCE
and BPFTRACE_KERNEL_SOURCE
env vars for where the headers should be loaded from, with the default being what is done presently. This might actually knock out two issues at once, as this is quite similar to https://github.com/iovisor/kubectl-trace/issues/7I'm going to go ahead, assuming the following:
Here's my work in progress PR https://github.com/iovisor/kubectl-trace/pull/48 that covers the gist of the initContainer approach I discussed above.
I haven't got it working end-to-end yet, as I've run into #7 being a problem, so I'll need to figure out a way to prove that this works by hackily overriding the header location, and then we can figure out what correct way to override the header location should be before the PR can be merged.
@dalehamel I love the initContainer approach, it is very well thought, I looked at your pr and the fact of having to find the headers doesn't seem a big problem since after the first time they are cached.
Also do you think that would be a good idea to ignore the initContainer if the user explicitly set BPFTRACE_KERNEL_SOURCE
after implementing #7 ?
All looks good to me as of now, I see that there's something to do still so let me known if I can help with anything and thanks for finding and implementing a solution.
Also do you think that would be a good idea to ignore the initContainer if the user explicitly set BPFTRACE_KERNEL_SOURCE after implementing #7 ?
Yeah I think depending on if / how #7 is implemented we can probably ignore the initcontainer by just not adding the necessary stanzas to the job spec. However I think that's something we should take to a discussion in #7, to avoid making the existing PR #48 any more complex.
All looks good to me as of now, I see that there's something to do still so let me known if I can help with anything and thanks for finding and implementing a solution.
As I mentioned in the PR, just some QA would be great to make sure I haven't broken functionality for users that already had worknig headers mountable from the host. I haven't been able to test that codepath yet but it should just work :tm:
Aside from that this should be pretty much ready to be merged, it works as I expect it to and I think is general enough that the initContainer can be used for other systems that might have a problem finding/loading headers in the future.
It looks like the image used by the tracer doesn't support GKE, as it doesn't have the necessary headers, I get the following output when I try to run the trace:
Looking at the image build, it's not surprising that this is the result. @jasonkeene 's
towel
tool, which is quite a similar tool, solves the problem of fetching google's headers in this way:https://github.com/jasonkeene/towel/blob/master/docker/download-chromium-os-kernel-source
Perhaps some logic could be added to check the host OS version, and if it is google's container os download the headers using a similar methodology?