Doesn't work with google container os (GKE)

dalehamel commented 5 years ago

It looks like the image used by the tracer doesn't support GKE, as it doesn't have the necessary headers, I get the following output when I try to run the trace:

if your program has maps to print, send a SIGINT using Ctrl-C, if you want to interrupt the execution send SIGINT two times
chdir(/lib/modules/4.14.65+/build): No such file or directory
definitions.h:1:10: fatal error: 'net/sock.h' file not found
Unknown struct/union: 'sock'

Looking at the image build, it's not surprising that this is the result. @jasonkeene 's towel tool, which is quite a similar tool, solves the problem of fetching google's headers in this way:

https://github.com/jasonkeene/towel/blob/master/docker/download-chromium-os-kernel-source

Perhaps some logic could be added to check the host OS version, and if it is google's container os download the headers using a similar methodology?

fntlnz commented 5 years ago

Good catch @dalehamel - I think that the solution you are proposing is how we want to address this issue. Another possibility could be to release the headers in the image or make a specialized image for google container os users.

dalehamel commented 5 years ago

Actually after a bit more digging, I think a better method would be to grab the sources for the specified build id directly from google storage, as sysdig appears to. You can just read it from /etc/os-release, and craft the URL, eg:

wget https://storage.googleapis.com/cos-tools/$BUILD_ID/kernel-src.tar.gz

This, and setting BPFTRACE_KERNEL_HEADERS appropriately should do the trick. Unfortunately the sources are quite large (~800mb decompressed), so maybe a clever exclude filter to tar could strip out unnecessary sources, leaving only headers?

Another possibility could be to release the headers in the image

The trick would be matching the sources to the image, which is on a rolling release window and not a fixed release schedule like other major distros.

make a specialized image for google container os users.

This probably makes more sense as it's a bit of a niche case. I'll take a look at where the container image is selected and see if I can figure out a sensible way to automatically override the image if container os is detected.

fntlnz commented 5 years ago

Great @dalehamel I suggest you to take a look at the tracejob and version packages. I think that since there are those rolling releases we probably need to build the specialized images and then have a way to figure out at runtime and download the proper headers. However covering all the possible cases can become complex. Another thing that would minimize this a bit is to always include headers for minor versions and select them based on the kernel, we will not have always the most accurate headers but that's better than nothing for those cases that we didn't do a specific implementation for like we are doing now for google container os.

So:

if the header are in the host, pick them using volumes like we do now
if it's google container os use the specific implementation for it
if no headers in the host and not a specific implementation try to use generic headers for that kernel version

dalehamel commented 5 years ago

Yeah generally speaking they shouldn't have to match exactly as the header apis be pretty stable. Targeting for specific kernel versions would be enough, yeah. But I think the issue is less that, and more the gamut of trying to support all of the possible versions.

I agree with your fallthrough logic, generally speaking i think that the only time you should need to do special handling for google is if the headers are detected to be missing and it's detected to be google's container os.

I think that it would probably make sense have an option to look for a tarball containing acceptable headers from an external source, to either bake this into the image that's used, or to try and pull down on booting the tracing container (perhaps both). This allows for caching (user only needs to maintain docker images that have the supported headers for the OS images they are targeting), while still being flexible (if matching headers aren't available, try and download them instead of failing hard).

Anyways, I'll get some code together, hopefully next week.

Side note... I wish that they would just bundle a tarball with the kernel headers like every other distro, but I can't figure out who to talk to in order to make this happen. I can understand why this is difficult, as the OS is basically meant to be immutable and this is probably an unanticipated use-case.

fntlnz commented 5 years ago

I 100% agree with your thoughts, looking forward some code from you, thanks for offering some effort to help getting this done.

dalehamel commented 5 years ago

I suggest you to take a look at the tracejob and version packages.

Thanks for the tips, I see what's going on here now. It looks like the image is hardcoded:

https://github.com/iovisor/kubectl-trace/blob/3e8de867ab6a3a13b9f08dca0469d331253470cb/pkg/tracejob/job.go#L254 https://github.com/iovisor/kubectl-trace/blob/9e635f2c904c18552ccc79f163bad78d81367f84/pkg/version/version.go#L27-L37

So a mechanism to override that from the defaults (I don't see anything that would allow this right now) would allow for this use case to be served with minimal code changes (provided the user can point this to their own vendored image containing the headers). I think I'll try and muck around with that locally, as a basic proof of concept first so that I can at least get this working with a custom image.

Since this logic is run from the user's machine, it's going to be a bit difficult to dynamically determine what container should be run :thinking: perhaps there can be some hints if we inspect the kubernetes node object, but I'm not sure how reliable that would be.

Aside from that, an on-boot script in the image itself should have everything it would need to know, just would need to add /etc/os-release as a read-only mount.

It looks like kubectl-trace currently assumes that your headers are good to go at /lib/modules, but perhaps a more clever way to approach this would be to put them somewhere else and override that directly, and use the BPFTRACE_KERNEL_HEADERS to override this within the container. This seems like an appropriate use case for an initContainer, which could just boot up to quickly check if headers are good to go, and otherwise download / mount them somewhere that'll be accessible to the actual trace container.

Before I go to wild on speculation though, I'll tinker a bit and get an mvp working, and worry about the right way to do this later.

jasonkeene commented 5 years ago

Thanks for pinging me on this issue. I wasn't aware that this project is now part of iovisor. Glad to see this effort is moving along!

Having the headers (not necessarily the full source tree as it is large) baked into special GKE images would be a great user experience. No additional start up time waiting for headers/source to download. Some complications with this are:

GKE doesn't require the use of container-os, you can specify alternative VM images to use
You don't really know what version of container-os you are running until you are running on the Node. For instance you can have multiple node pools in a cluster with different versions of container-os. You wouldn't really know what Node you are going to get scheduled onto until you are scheduled and running on it.

One idea I had was to bake the headers for all the container-os kernel versions supported by GKE into a single image and then just have an entrypoint script set the right value for BPFTRACE_KERNEL_SOURCE based on uname. This is a moving target so this would have to be automated in a CI system somewhere to stay current with the images available for GKE. Also, this puts an additional cost on the user who now has to download headers for kernel versions they are not using. Hopefully, this would be compressed efficiently by docker so it wouldn't be much of an issue. Also, this image would be cached on the node so you would only have to download the image once per node.

An initContainer that downloads the headers is another solution I've considered. It has the downside of forcing the user to download these headers every time the Pod is scheduled. I feel this would be a fairly high cost to the user experience.

jasonkeene commented 5 years ago

Another solution would be to provide a DaemonSet that would run a pod on all nodes that would do the work of downloading the headers once for each node. Then the pods that are created for the actual jobs would just volume mount the kernel headers from the host. This might be tricky as, from what I remember, most of the host's filesystem is mounted readonly. Optionally, you could provide an initContainer that blocks until the headers are downloaded and ready.

dalehamel commented 5 years ago

bake the headers for all the container-os kernel versions supported by GKE into a single image

That could work, but depending on the size the first pull would be slow, but after that shouldn't be a problem and have decent performance. Just need a way of tracking possible GKE releases for the image build / maintaining a list of targets.

GKE doesn't require the use of container-os, you can specify alternative VM images to use

Yeah, there is an ubuntu flavor as well. I imagine the headers are much easier to fetch for that. You raise a good point that the solution for GKE should be generalized enough to handle whatever flavors of OS image are available (my most recent knowledge was an optimized ubuntu, as well as container OS).

I'm going to focus on container OS, as that's the problem that is more practical for my use-case to implementing a fix here, and I think that the ubuntu implementation will be pretty easy to add as it should just be able to mount in the headers from the host via the /lib/modules hostmount (it may well work already).

You don't really know what version of container-os you are running until you are running on the Node.

I can think of ways to get it from where kubectl is running, but none that are very elegant. If you had a way of getting this into the node metadata, such as an on-boot daemonset that writes the build id after reading from /etc/os-release, eg:

$ cat /etc/os-release
BUILD_ID=10895.91.0
NAME="Container-Optimized OS"
KERNEL_COMMIT_ID=ff03fe06c0fc35868c5ada1306e9471a48bec9c3
...

However, I'm not certain that the client (probably dev on their workstation with kubectl) actually needs to have this information. It's probably best to keep the client logic as simple as possible, and push this out to the tracing container, where it's trivial to fetch the build id.

An initContainer that downloads the headers is another solution I've considered. It has the downside of forcing the user to download these headers every time the Pod is scheduled. I feel this would be a fairly high cost to the user experience.

I think that an initContainer solution is probably the simplest, and can be optimized to make it be a good experience still. In the environment that I am working in, I already have a CI pipeline for chromiumos that generates kernel headers from the gentoo ebuild that google uses for container os, so I can just fetch a small archive that's already got the stripped headers that can just be fetched from a private GCS.

That's not really feasible for open source though, I think that a more reasonable approach to do this reliably is to do something similar to what sysdig does for an init container. If you look at how sysdig handles this, they detect the build ID and download the sources on-the-fly, see https://github.com/draios/sysdig/issues/1061 which gives a decent overview of how this can be done.

This might be tricky as, from what I remember, most of the host's filesystem is mounted readonly.

At least in our environment, this is pretty easy. There is a writeable partition at /mnt/stateful_partition, where these could reside. Just picking a standardized folder, such as:

/mnt/stateful_partition/kernel_headers/$BUILD_ID

The initContainer would then only need to do this once per machine, with all other uses of the tracer on the same node being able to use these cached headers by sharing this read-only mount. The initContainer could just bail quickly (and use the same image as the target container) if the sources have already been downloaded and installed correctly.

If this approach seems sane to y'all I'll start implementing the script for an initContainer, and inject it into the spec. The initContainer could probably be generally useful as an entrypoint for ensuring that the correct headers are installed, and extended to help fix quirks for other platforms if necessary.

dalehamel commented 5 years ago

Based on the sysdig work and @jasonkeene's script, I've cooked up this script, which will ensure that the headers are installed on container OS:

https://gist.github.com/dalehamel/15efcf02e6fd7999b5a151cd2b5702e8

On the first run, it takes 24 seconds to fetch, extract, and generate the necessary headers to the correct location. After this, if the script is run again it returns nearly instantly thanks to the dotfile checks.

So basically, this (well, a cleaned-up version with a switch statement to handle different distros probably) would serve as the initContainer's main script, which would just need to be configured with a path to install the headers to, and a mount for /etc/lsb-release from the host. The first time that a machine attempts to trace something this will install the headers on the machine, to be shared by all future tracers.

All the trace container needs to do is mount in the directory prepared by the initContainer and everything should just work :tm:

@fntlnz do you have any thoughts on the initContainer approach? IMO this is the cleanest separation of concerns, the impacts to kubectl-trace at large are:

All traces would first spawn an initContainer, unconditionally. This will ensure that any additional bootstrapping / setup such downloading the necessary headers is done, but for systems where this already works out-of-the-box, we'd be adding an unnecessary step. It should be very low overhead, so long as the same container image as the tracer is used and I don't think users will notice. (this is the more major design decision)
The initContainer would need to agree on a standard way of loading the kernel headers with the actual tracing container. Perhaps the initContainer could write to a file what path should be used (host path, or generated/fetched path), to preserve the existing behavior? Or perhaps it could write an environment file that sets the BCC_KERNEL_SOURCE and BPFTRACE_KERNEL_SOURCE env vars for where the headers should be loaded from, with the default being what is done presently. This might actually knock out two issues at once, as this is quite similar to https://github.com/iovisor/kubectl-trace/issues/7

I'm going to go ahead, assuming the following:

Any tradeoff in speed to run a trace should be measured, and if it's not significant then having the headers exist reliably (and a framework for ensuring this in various environments) is worth whatever additional boot time
The issue of where the headers are loaded from isn't a big deal and can be determined in PR review, and perhaps is a matter of discussion for https://github.com/iovisor/kubectl-trace/issues/7

dalehamel commented 5 years ago

Here's my work in progress PR https://github.com/iovisor/kubectl-trace/pull/48 that covers the gist of the initContainer approach I discussed above.

I haven't got it working end-to-end yet, as I've run into #7 being a problem, so I'll need to figure out a way to prove that this works by hackily overriding the header location, and then we can figure out what correct way to override the header location should be before the PR can be merged.

fntlnz commented 5 years ago

@dalehamel I love the initContainer approach, it is very well thought, I looked at your pr and the fact of having to find the headers doesn't seem a big problem since after the first time they are cached.

Also do you think that would be a good idea to ignore the initContainer if the user explicitly set BPFTRACE_KERNEL_SOURCE after implementing #7 ?

All looks good to me as of now, I see that there's something to do still so let me known if I can help with anything and thanks for finding and implementing a solution.

dalehamel commented 5 years ago

Also do you think that would be a good idea to ignore the initContainer if the user explicitly set BPFTRACE_KERNEL_SOURCE after implementing #7 ?

Yeah I think depending on if / how #7 is implemented we can probably ignore the initcontainer by just not adding the necessary stanzas to the job spec. However I think that's something we should take to a discussion in #7, to avoid making the existing PR #48 any more complex.

All looks good to me as of now, I see that there's something to do still so let me known if I can help with anything and thanks for finding and implementing a solution.

As I mentioned in the PR, just some QA would be great to make sure I haven't broken functionality for users that already had worknig headers mountable from the host. I haven't been able to test that codepath yet but it should just work :tm:

Aside from that this should be pretty much ready to be merged, it works as I expect it to and I think is general enough that the initContainer can be used for other systems that might have a problem finding/loading headers in the future.

iovisor / kubectl-trace

Doesn't work with google container os (GKE) #47