Closed reidpr closed 4 years ago
Some documentation and related work on that:
curl
and sha256sum
(from coreutils), sadly not stating a license, and not extracting the layers yet: https://gist.github.com/mickep76/1ca682f258d4ab43569b5b550bc0e66eLet me know if I should give it a shot (but I'm unsure how fast I will be). My approach would be to use bash + curl + sha256sum only (+ tar of course), but then one also needs some temporary directory, or maybe better a persistent, configurable cache directory for layer storage (to make subsequent pulls faster). Configuration could be done via an environment variable or optional command line argument to the new tool (ch-dockerhub2dir
?), and if unset, default to non-persistent mktemp -d
which is cleaned up after.
I think the way to demonstrate this would be to create an image test that uses a Build
script to pull down an image using the above techniques. This would be orthogonal to whether there's a first-class script (which should probably product a tarball, not a directory).
Currently we assume the wget
command line utility is available at make test-build
time, but not curl
. I'd like to keep just one or the other but I'm open to changing if that's appropriate. There is a comment to this effect somewhere.
I want to make sure not to use or re-implement any Singularity code because the license situation is unclear. While it's advertised as BSD 3-clause by GitHub, there's a second LICENSE-LBNL.md
in the repo that's different. At best the situation is fuzzy, and at worst Singularity is not open source (the latter license hasn't been evaluated by anyone other than LBNL). I've asked Greg about this but not gotten any clarification I was satisfied with, and at the time only the latter file was in the code.
I think the way to demonstrate this would be to create an image test that uses a Build script to pull down an image using the above techniques [...]
Ok - so just to clarify, you mean: "Writing a new" build-pull-from-docker.bats
, which pulls an image (such as Alpine) using wget
(should be possible - I can give it a spin) and can be used alternatively / addtionally to build.bats
(so tests also work if docker is not there), and then later, the code can be migrated to / incorporated in a first-class script? I can give that a try (I'm new to bats, but it looks rather useful and is mostly just bash).
[...] (which should probably product a tarball, not a directory).
Agreed. My choice of "dir" was two-fold here:
I want to make sure not to use or re-implement any Singularity code
I also fully agree. I only put the link to show the general approach for comparison - but I don't think that's the way to go. The API docs are clear enough, and it's possible to do everything in a few bash commands, without requiring a python class in addition.
Ok - so just to clarify, you mean: "Writing a new" build-pull-from-docker.bats, which pulls an image (such as Alpine) using wget (should be possible - I can give it a spin) and can be used alternatively / addtionally to build.bats (so tests also work if docker is not there), and then
If you create an executable file named exactly Build
or Build.foo
, the test suite will pick it up as another test image. The details are described in test/README
; if it's not clear, that's a bug.
later, the code can be migrated to / incorporated in a first-class script? I can give that a try (I'm new to bats, but it looks rather useful and is mostly just bash).
I think it's orthogonal to whether there's a first-class script, and whether to make such a script right now depends on the level of demand for it. I'm not familiar with this use case of Docker, so I don't know the answer. This issue is a suggestion from @louisvernon.
It might be useful to actually directly keep that directory (in case you want to use the container on the very same machine).
My vote would be to remove the temporary directory by default because then the script would have a single well-defined product. How about a command line option to leave it?
We could also document what's expected in Charliecloud tarballs and directories (example: I don't recall where we make /mnt/[0-9]
), but that's probably a different issue. @jlowellwofford is working on a validator script for image directories; maybe that's where we document it.
I personally would expect this (pull image from a docker registry) will be a common usage model for prospective CharlieCloud users.
I have some simple scripts, including a modified version of the @mickep76 gist linked above that download the layers, extracts them in the order defined in the manifest and creates oldroot. The resulting output appears to work correctly with CharlieCloud.
I can submit a PR with this refactored in the form of the proposed `ch-dockerhub2dir command. Some thoughts:
The details are described in
test/README;
[..]
I missed the obvious README
and tried to deduce behaviour from the code - I think that's a bug in my head, but thanks for resolving that, the README
is very instructive!
[...] whether to make such a script right now depends on the level of demand for it.
I would personally also be interested in it. The idea could be that a "build workflow" including publishing on a Docker registry already exists, and Charliecloud is "only" used as runtime in an environment in which neither a shared filesystem exists not Docker is installed (e.g. in a public cloud). Using wget
and basic tooling only, Charliecloud could then still fetch and run images from DockerHub (or another registry) without the need to have Docker installed.
@louisvernon : Sounds like you are already there, so I'll leave the PR to you of course ;-).
An additional comment:
/oldroot
is not required anymore as of https://github.com/hpc/charliecloud/commit/052e7f1424e95261e45925450c4226338283e8ad
Sounds good. I'd say feel free to proceed on a PR if you like, especially if @louisvernon doesn't speak up within a few days. From talking with him (he's also at LANL), he's pretty busy and hasn't laid any particular claim to a PR beyond asking for the feature.
Please ignore previous comment about @louisvernon; not paying attention; my apologies.
I learnt via Singularity that there's a further issue. If a layer deletes something from a previous layer, it seems this is done via "whiteout files" which declare which files should be deleted.
This is also not (yet) handled in Singularity, there are PRs here:
I'm not sure how complicated it needs to be, potentially one could also think about externalizing this, i.e. developing an independent tool doing just that. Maybe it already exists, I did not research this yet.
If you use tar to extract the layers you want to make sure to use "--overwrite" this avoids corner cases where tar might not overwrite an existing file.
@louisvernon, are you still planning to submit a PR on this?
Hi, on related news, I have started a project here: https://github.com/olifre/dockerhub2oci which focuses on downloading from dockerhub and creating an extracted OCI image. It's still in very early stages, for example, it does not handle whiteout files yet - which appears to be not so easy if one plans to do it in a safe way.
Skopeo: https://github.com/projectatomic/skopeo seems to be also able to do it, but has quite some dependencies and a lot of other features, which is why I try to do something more simple, shell-based.
A job change and house move put this on the back burner. Whiteouts weren't on my radar so it looks like @olifre is way ahead on this effort. If someone else wants to take this be my guest. I did find this more actively maintained script to pull in the layers. https://github.com/moby/moby/blob/master/contrib/download-frozen-image-v2.sh
@louisvernon The script you linked also has explicit support for multi-arch manifests (but does not appear to handle whiteouts / the actual extraction I think).
Seeing all the specialties that can occur, I am not sure anymore whether it's actually feasible / a good idea to integrate that into charliecloud (as an extra script), or rather mention the existing external tools (skopeo / the script from moby / my script if it's a bit more advanced) and potentially use them for the testing in case docker is not available.
Additional options (both in PyPI too):
It looks like the basic procedure is:
oh this is totally doable! Ping me if you need help, because the call to get the manifests is exactly the same but depending on the Accept header you get different versions back. This threw me off for months (and at one point was asking for a list) and I literally just figured out the different calls for each of the version 1, 2, config, and list recently --> https://github.com/singularityhub/sregistry-cli/blob/master/sregistry/main/docker/api.py#L139 Will be super cool to see this!
and the docs are much better now too, so I think this will be a lot quicker!
Thanks for the tips, Vanessa.
I'll just add for anyone contemplating a PR on this to be careful of licensing. The policy of Charliecloud is to only take code from projects with well-defined licensing that's compatible with our Apache 2 license. In this case the sregistry-cli
project looks to be Affero GPL v3, which AFAICT is not compatible. Please take advantage of @vsoch's advice, but don't look at the linked code.
Yes very good point! The AGPL is primarily done to keep things open source and publicly available, so for example, someone couldn't take it, privatize it, and then try to make profits off of it. They CAN take it, change it to their liking, and just release under the same conditions. The "wrong thing" would be akin to copy pasting the code, line for line, which I didn't mean to suggest, just checking out the general flow. And I would be happy to help (with my words!) to talk about how to implement something, or improve within sregistry too :)
I'm still not sure whether a solution "within" Charliecloud or a separate tool would be best.
In the end, Charliecloud only requires a "flattened" DockerHub container,. and one should ensure some directories are present (such as the one for the pivot-root
magic).
This is similar for runc, railcar or singularity, while singularity in addition adds and expects some environment files inside the container directory (and appears to be focussing on the special image file format more and more).
So potentially, one tool could accomodate several container runtimes. It would also allow to e.g. install only the "pull and tar it up"-tool separately, for example on a machine pulling docker containers, extracting them to a shared filesystem / CVMFS, while the actual container runtime (e.g. charliecloud) would only be installed on the workernodes which could be kept separate.
This was my main intent behind producing https://github.com/olifre/dockerhub2oci separately. It's just a first effort and for sure not perfect, and maybe a better tool will pop up, so I am mainly questioning if the general direction should be an "integrated" tool or rather something external.
If you prefer something integrated, I could think about relicensing dockerhub2oci
to Apache 2, I have only one contributor right now and could ask whether he is fine with that. My idea to choose a GPL license was mainly to ensure Copyleft, but I could give up on that for this small piece of code.
Of course, if somebody knows of a more advanced project, this would be even better.
@olifre I have thought about this quite a bit too. If you account for all the future APIs / functions related to moving things and generation inspection (search, inspect) there are seemingly infinite endpoints that a user could be interested in as places to either obtain layers, or entire images. I was thinking of this in the context of Singularity, and so this was the rationale behind the Global Client (sregistry). While it would be infeasible to expect the core container software (Charliecloud, Singularity, etc.) to be able to implement every possible endpoint-place-to-get-stuff, it's not such a crazy idea to have a small, modular client that does a very basic functionality of retreiving, and then dumping, and handling all the API calls, authentication, and whatnot. Then (arguably) the user could use the client to first get the layers and build the container from a folder, or if it's possible, the client could act as a plugin for the container software so it appears to happen seamlessly.
I'm not sure which is the better approach (separate with the option for integration or just separate) but I see this need pretty strongly, and would definitely be interested in helping develop. For Singularity Global Client I've navigated a list of the contender APIs (e.g., Docker Hub, Nvidia Cloud (which is the same really), Dropbox, Google Drive, Storage, Singularity Hub / Registry) so we would want to accomplish the same, but basically just remove the singularity specific commands to build the image. What do you think? And do you think this can all be accomplished with bash, versus something a bit higher level that can most easily handle requests / string parsing and what not? I would love to help out on this, if there is some consensus!
I agree, that's a legitimate question and I don't know the answer either yet.
@olifre, either way, I'd love for dockerhub2oci
to be relicensed to Apache 2 (or BSD/MIT/other compatible license). Even if we didn't incorporate the code, I'd love to be able to read it.
@vsoch Nice writeup! Yes, indeed, I also believe there is a strong need for such a tool. My main reason for the decision to do something in pure bash with a few external helpers was to have something very lightweight, without any dedicated scripting language, that even very security-oriented admins may run on e.g. their CVMFS server machine (which is actually what I'd like to use it for myself one day). So my reasoning to start with that was mainly lack of time for something larger, and an existing need ;-). Of course, to really support the several APIs including authentication for private registries, multi-architecture containers and things like this, a high level language might be more appropriate, I agree.
@reidpr I have asked my only contributor in the linked issue whether he would be fine with a relicensing to Apache 2.0, in case he accepts, I will change the license (and try my best to also convince github to show this license, never tried that up to now).
@olifre: Great, thanks.
FWIW, Charliecloud's Apache license does not show, but it sure seems like it should (#122).
hey did you guys see https://github.com/jessfraz/img from @jessfraz? And human documentation is here. It's a daemon-less OCI compliant builder and it seems to address much of what we are talking about here (but I haven't tried!).
@reidpr dockerhub2oci
(at https://github.com/olifre/dockerhub2oci ) is now relicensed to Apache 2.0 after all contributors (me and another one) agreed.
Also, I managed to convince GitHub of the change by first deleting the old license in a commit, and then using their web interface ( https://help.github.com/articles/adding-a-license-to-a-repository/ ) to add the new license file, afterwards squashing the commits together.
However, looking at the project pointed out by @vsoch , this looks far more advanced than my efforts!
I missed that in my research beginning of January 2018, likely since it was almost non-existent at that point in time. However, right now, it seems this project indeed addresses most of the points addressed here, and it's under the MIT license.
I believe my dockerhub2oci
is still a valid approach for a small, almost-no-dependencies tool to pull docker containers, but for anything more advanced, https://github.com/jessfraz/img appears to be the better choice. I also did not test it yet, though.
@olifre this is great news, and there is no reason to drop your work because other base exists. There are many google (oops, freudian slip?) examples of an "underdog" turning out to be very important. I will continue to support your effort.
Update: We're now using skopeo and umoci to pull images. The obvious scripts (e.g., ch-pull2dir
) haven't been updated yet, but ch-grow
, which is written in Python 3, uses them.
If the latter becomes a long-term thing (currently it's still experimental), it would be nice to use Python libraries rather than calling out to executables, if they work well enough.
I heard a rumor that one can pull an image from Docker Hub directly with REST calls, without needing Docker installed. Investigate.