Open jsturtevant opened 1 year ago
As noted in the meeting today, a PR clarifying/fixing this would definitely be useful (even if only to have something slightly more concrete to discuss around :bow:).
Is there a specific reason to prefer null
instead of, say, an empty list? (thus still requiring the field, but moving from one required item to zero)
I'm separately curious, how is layerFolders
being empty in this case? Shouldn't wherever the base image is mounted be in that list, and perhaps a scratch directory after that? Even without a base layer (e.g. in a root-is-a-host-mount) I'd have expected that to be a bind mount to a scratch folder location, so layerFolders
would have a single entry.
It's possible I'm misunderstanding something, as I'm not recently-familiar with host process containers so I am unaware of any particular mounting-behaviour differences.
Edit: Trying to educate myself through GitHub's web interface, I found in https://github.com/containerd/containerd/pull/6618#issuecomment-1059017171 that layerFolders
was logged as nil
in a non-host-process execution. Poking around in hcsshim, I think it'll reject host process containers with less that two entries in layerFolders, like it does for WCOW process-isolation. It's the same code... It mounts the same, and later host-procss containers also bind-mounted into the host.
So is this nil
actually an intermediate state? I didn't dig too far into it, but (Edit: lol. lmao.) I suspect that the hcsshim runtime shim gets the structure with the nil
, and then populates layerFolders
from the runtime v2's Create call before it uses it, but at some points it's serialised to disk before that happens (presumably the serialisation that discovered this problem is literally what containerd sent through, not what hcsshim actually passed down into the OCI->HCS converter).
containerd doesn't appear to have populated this field since the runhcs runtime was dropped in favour of the v2 runtime.
So right now, it appears containerd sends an out-of-spec OCI container config, and always has done in runtime v2. The hcsshim v2 shim is a willing participant in this, as it doesn't try to use that field.
My mild concern with making that field nullable or empty-list-okay is that such a container config will be rejected by the rest of the hcsshim machinery, if you pass it by something other than runtime v2 shim's Create; conversely, there's no way for containerd to populate that field usefully because the runtime shim's processing of the Create
call is what currently makes the names that go in that field exist.
So perhaps a workaround is that the field is made nullable, but the spec notes that in this case, the receiver may reject the spec if it cannot populate that field from pre-existing/extrinsic knowledge. There's already a few cases in this same flow where the spec and containerd can agree, but hcsshim then rejects or locally-fixes the provided configuration for non-spec-visible reasons.
Anyway, it doesn't look like this is actually related to host-process containers to me, but it might be easier to trigger there?
Note that moby/moby does mount and populate layerFolders, which is also how the runhcs runtime in containerd used to work. It's possible (but not certain) that when the containerd image store work comes to Windows (which would make the containerd backend mandatory), then this will stop being the case, if the containerd backend even still supports this caller-managed-mount flow.
So is this nil actually an intermediate state? I didn't dig too far into it, but (Edit: lol. lmao.)
🤣 Thanks as always for your attention to details
Anyway, it doesn't look like this is actually related to host-process containers to me, but it might be easier to trigger there?
I created a process isolated container via cri to double check, and found that it is also writing a config.json
file with a blank config:
k apply -f https://raw.githubusercontent.com/jsturtevant/windows-k8s-playground/master/deployments/whoami-windows.yaml
------
C:\ cat ProgramData\containerd\state\io.containerd.runtime.v2.task\k8s.io\a7e1ae0ba90b8209c0d85ea1560d8c25353a8ae3b46e8a77ac62e193b6ce08a9\config.json
--- snip---
"io.kubernetes.cri.sandbox-namespace":"kube-system","io.ku
bernetes.cri.sandbox-uid":"1a11517f-b759-473e-9337-d6be68218a12","microsoft.com/hostprocess-container":"true"},
"windows":{"layerFolders":null,"resources":{
"memory":{}},"network":{}}}
So right now, it appears containerd sends an out-of-spec OCI container config, and always has done in runtime v2. The hcsshim v2 shim is a willing participant in this, as it doesn't try to use that field.
What do you think about fixing this in containerd? Would that be the right approach? I am working on a different shim and I can certainly make it work this way but it feels wrong to work around it.
I'm not sure we can fix this in containerd, because the correct values don't actually exist when containerd passes the container config down to the shim, if it's using the Rootfs
field of the runtime v2 shim's Create message (which appears to be the current state on Windows).
We'd either need to revert containerd to do all the layer folder mounting itself before calling the runtime shim, or change the runtime shim protocol so that layerFolders contains dummy paths that are overwritten by the Create message's paths. This would require changes in the shim implementation in hcsshim (an ABI break!) since it currently (incorrectly?? Semantically that makes no sense) appends to layerFolders.
Very late edit: hcsshim 0.12 fixes this to reject a create request with both a RootFS in the containerd request, and Windows.LayerFolders
in the OCI spec.
I think making layerFolders
optional with a note that "receiver needs to handle or reject this situation" as I described above would best-resolve this.
Longer-term, it might be feasible to subsequently deprecate layerFolders entirely, as apart from the existing calls in Docker (which will hopefully be replaced by containerd anyway), it's currently only ever populated internally by hcsshim long enough to flow into other parts of hcsshim and be converted out again.
I noticed while poking at this that the spec file in the original post has another invalid aspect: It has
"root": {
"path": ""
},
and for a process-isolated Windows container, that string is a mandatory volume path, which again, containerd doesn't have before it calls the shim as it's not doing the mounting itself.
This does provide an example of a 'formally-optional' field that has semantically-define requiredness which is ignored when passing from containerd into the runtime shim.
A final option is to say that this config.json is not a valid OCI spec, but merely a part of one being passed around as part of the containerd runtime v2 shim API. Since you're implementing a shim, this seems like the worst possible outcome for you, by punting a bunch of extra parsing/handling logic for the potential differences.
Given the redundancy between the OCI container spec and the runtime v2 API it probably should have been originally documented that way.
thanks for the analysis. Took me awhile to digest it all as this area is new for me. I did a deep dive and see what you are saying. This issue of the runtime config not being correct is the same for both HostProcess and process isolated.
A few comments.
and for a process-isolated Windows container, that string is a mandatory volume path, which again, containerd doesn't have before it calls the shim as it's not doing the mounting itself.
The root path is generated by hcsshim during the "create" and then passed when the container is system call that creates the container. task create call and spec set during create. This happens since the file system needs to be turned in to a volume mount which containerd isn't doing at this time as you point out.
I am not to sure what to do about this one. It seems like it is no longer really required in the containerd flow and I am not sure what scenarios it would be useful in. I am likely not going to need it either (due to the same layering situation).
This would require changes in the shim implementation in hcsshim (an ABI break!) since it currently (incorrectly?? Semantically that makes no sense) appends to layerFolders.
I think making layerFolders optional with a note that "receiver needs to handle or reject this situation" as I described above would best-resolve this.
I was looking into this and ctr generated these by calling get snapshot and generating mounts. The shim then just re-orders these into the correct order for the layers field. We could do the same when generating the runtime config and the result would be the same.
But as you point out hcsshim appends the layers from the rootfs request. So It would require changes in hcsshim either way. I am not sure I understand why changing hcsshim to not append would be a breaking change though... I believe the result would be the same. I.E. if layersFolder present then skip otherwise if empty take from Request field and do current logic. What am I missing?
A final option is to say that this config.json is not a valid OCI spec, but merely a part of one being passed around as part of the containerd runtime v2 shim API.
It seems that if we were able to fill in the layerFields this would be make the runtime config.json valid. In other words another runtime could use that config.json to do all the work required (since mounting the root and getting it's path could be optional field).
I was looking into this and ctr generated these by calling get snapshot and generating mounts. The shim then just re-orders these into the correct order for the layers field. We could do the same when generating the runtime config and the result would be the same.
True, but then we're just duplicating work (and implementation) on both sides, and introducing more WIndows-specific code into containerd cross-platform code (rather than hiding it in the WCOW snapshotter as we do now with the parentLayerFolders JSON sidechannel). It also duplicates the information in the containerd runtime V2 shim use-case, which is likely to be the only use-case once Docker migrates to containerd on Windows.
Since there's only currently one consumer of this field (LayerFolders
) and it clearly doesn't care about it being empty when it otherwise knows what goes there, I'd suggest making that the specified behaviour. It also allows potentially-layerless containers, such as host-process containers without filesystem isolation. (That's probably the only use-case for layerless containers, though).
So It would require changes in hcsshim either way. I am not sure I understand why changing hcsshim to not append would be a breaking change though...
Because newer containerd with older hcsshim would end up with a double-list of layerFolders, and fail elsewhere (I suspect the failure would only happen inside hcs, as nothing in hcsshim is going to check the layerFolders are a valid chain of layers from base to scratch, it just translates from OCI format to HCS format).
I personally think a forced hcsshim/containerd lockstep upgrade needs more justification than "sending extra information to hcsshim that is just going to be immediately overwritten with the existing information".
Anyway, I think the approach of changing containerd and hcsshim's behaviour here would be better discussed on one of their bug-trackers.
As I mentioned before, I do agree that appending the Mounts to the LayerFolders (rather than overwriting or validating that they are the same) is probably a mistake (I can't think of a situation where this is actually going to be valid... prepending maybe), but the maintainers of those stacks are going to be better-placed to look into the blast-radius and potential upgrade problems for users of changing that.
And fixing the spec to match the real-world behaviour (or simply declaring that this config.json is not required to be a valid OCI container) seems like a simpler thing to change in the meantime.
Honestly, I'd love to get rid of layerFolders entirely, and make root.path
better-specified for mounting-is-done-by-the-runtime use-cases, e.g., kata-containers would have the same issue, I expect. But I don't see a good path forward for that, considering value and cost/effort.
True, but then we're just duplicating work (and implementation) on both sides, and introducing more WIndows-specific code into containerd cross-platform code (rather than hiding it in the WCOW snapshotter as we do now with the parentLayerFolders JSON sidechannel). It also duplicates the information in the containerd runtime V2 shim use-case, which is likely to be the only use-case once Docker migrates to containerd on Windows.
Is the intent of the config.json and runtime spec that a runtime should be able to configure a container from the config.json? The layers that make up a container are required to properly configure a container. Using Containerd's V2 flow means that other runtimes would need know and adapt to Containerd's runtime does and that the config.json doesn't have enough info to create a container.
Even though this is windows specific code, it is similar to what containerd has to do to generate a compliant runtime config.json for Linux specific config.
Anyway, I think the approach of changing containerd and hcsshim's behavior here would be better discussed on one of their bug-trackers.
Agree, though what is decided in the spec should drive that behavior.
Honestly, I'd love to get rid of layerFolders entirely, and make root.path better-specified for mounting-is-done-by-the-runtime use-cases, e.g., kata-containers would have the same issue, I expect.
This does seem a good approach but I don't know If I have enough context to know where this should go.
First up: I checked, and per the README, the bundle directory is documented to be an OCI bundle.
That said,
Is the intent of the config.json and runtime spec that a runtime should be able to configure a container from the config.json?
Yes, that is the intent of the OCI Bundle and config.json, but
Containerd's V2 flow means that other runtimes would need know and adapt to Containerd's runtime does and that the config.json doesn't have enough info to create a container.
we're talking about the config.json generated by containerd to pass to implementations of the containerd v2 runtime shim API, this is already the core requirement of those implementations. They can't ignore the RPCs from containerd and just take the bundle directory as a ready-to-start OCI Bundle, particularly because they'd have no rootfs. They can (and do) ignore pieces of config.json that either don't apply or conflict with the RPCs from containerd, such as the rootfs path.
We're not talking about random config.json found lying around in the world.
what is decided in the spec should drive that behavior.
I think this is backwards. The spec should follow implementation experience, which is why I suggest discussing this with the implementors.
My current feeling is that requiring that the CWD of the runtime shim be a valid OCI Bundle is probably a misspecification, as apart from the runc case where the shim just passes config.json through unchanged, nothing else cares if this is a valid and complete OCI Bundle, as not all fields are relevant to how all runtimes function. (Also, as a nit-pick, if rootfs.path is set, then it's not a valid OCI Bundle, as that would require the rootfs to be already mounted, but the runtime v2 shim spec explicitly requires the shim to take care of any mounting.)
Event for the runc shim, I feel like this is risky design, as there's two paths carrying putatively the same information (CreateTaskRequest
and config.json) and right now the runc shim does not read the config.json, so there's no validation that they are actually the same. It probably should be updating the config.json to match the rootfs, terminal state, etc., passed in via CreateTaskRequest
.
kata-containers, for example, loads the config.json and then proceeds to overwrite pieces, and cherry-pick the things it needs from it, ignoring other pieces of favour of CreateTaskRequest
. I didn't trace all the way through, so I'm not sure if it does actually use the read-in spec data directly at any point, but since it runs things inside a Hypervisor which may not be able to access host directories, I can't see how it could expect to use that config.json unmodified.
I'd suggest that maybe the config.json in the containerd v2 runtime shim be respecified as "OCI container spec except the following fields are optional because they also exist in the CreateTaskRequest
(the latter is authoritative in case of conflict)" and also not require the rootfs to be present (since it's not now, by design of the runtime v2 shim).
On the other hand, maybe the containerd maintainers will feel it's worth trying to get more of the config.json populated with information that also appears in the containerd v2 runtime shim protocol, sufficient to let a spec-compliant syntax parser work, even if the semantics and platform-varying requirements are not checked (in some cases because they cannot be validly populated, and in some cases because very few JSON parsers can actually validate fields whose valid states depend on other fields), and they do not actually match the behaviour of the shim (i.e. config.json is unreliable for introspection), and we need to require a specific minimum version of the Windows v2 shim because the existing back/forwards ABI compatibility is broken because the Windows shim (faultily) depends on receiving a spec-invalid config.json.
Perhaps another, better solution will be discovered?
I can't see any further value being derived from just poking at the spec doc at this point, myself, since at this point any changes made to the spec for this use-case are simply carving out exceptions for how containerd uses config.json in its own API, but in turn make OCI Bundles less-reliably-runnable for their actual use-case, being a complete and self-contained runnable container spec. (Making layerFolders optional feels mostly like this , although there's an independent case to be made that the current root.path spec for Windows means layerFolder should have been optional when the root is passed in as a volume name, as layerFolders data has already been done by the caller to create that volume. But I'd have to check hcsshim to be sure how it resolves that conflict today, in the non-containerd case.)
An option would be to take the time to review the config.json spec in the face of the world having gotten more complicated than runc, crun, and (briefly) runhcs, which is a much wider task and could lead to cleanups as I mentioned above around removing LayerFolders and generalising root
to be able to describe nearly-arbitrary rootfs sources.
Thanks for the details, I've learned a ton from this thread.
That all makes sense to me. It seems as though there has been some drift in the implementation and the spec (specifically the runtime bundle) and how it is used.
An option would be to take the time to review the config.json spec in the face of the world having gotten more complicated than runc, crun, and (briefly) runhcs, which is a much wider task and could lead to cleanups as I mentioned above around removing LayerFolders and generalising root to be able to describe nearly-arbitrary rootfs sources.
I don't know what the community's appetite for this would be or where to best begin the conversation.
In the meantime, for my case, I can use similar approaches where I can build the layers I need from the Containerd V2 CreateTaskRequest
, given I can make the changes to https://github.com/containers/oci-spec-rs/issues/126 to accept empty or nil.
Is it worth opening a PR here to make those fields we discussed (root.path
and windows.LayerFolders
) optional?
Opening such a PR might be worth it, as it'll make the proposed changes concretely visible, which is probably a good basis for discussion. I wouldn't wait for the resolution though; getting the Rust spec reader to not panic when a non-optional field is missing seems reasonable.
The outcome might be the weakening of the field from "minimum 2 entries" to "minimum empty list" and then fixing the places (like ctr
mentioned in https://github.com/containers/oci-spec-rs/issues/126) that are putting nil
there. (Or fixing the output in that case, since inside Go, nil
is deliberately interchangeable with an empty slice in slice operations, but as you noted, the built-in JSON serialiser lacks support for this)
Propose to update the Runtime spec to allow for null or empty in the
Windows.layerFolder
field. I found an issue where this wasn't being parsed correctly: https://github.com/containers/oci-spec-rs/issues/126. It works for HostProcess Containers in the go implementations due the way go serializes lists.The schema and spec for windows state the
layerFolder
should be a min of 1 item:There was recently work to enable Host Process containers for Windows and a scratch image was created for it. When running that image the runtime config doesn't have a layer folder: