moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.19k stars 1.16k forks source link

[RFC] ADD/COPY scripting improvements #2917

Open DYefimov opened 2 years ago

DYefimov commented 2 years ago

related;

Abstract

Currently the following construct is impossible:

# syntax=docker/dockerfile:1.4.1
FROM scratch
ARG SOURCES="./src/file1 ./src/dir1 ./src/dir2"
COPY $SOURCES ./

Therefore severely limiting metaprogramming capabilities of buildkit and dockerfiles.

The most valuable application of such a construct is dockerfile target reusability, e.g. you can build lots of similar images that differ only by filesystem content without code duplication or any external tools like templating engines.

You can check the real world use case in more detail at #2893 where this RFC originates from.

To achieve such functionality without introducing any breaking change while retaining backward compatibility, two optional flags --split-sources and --preserve-top-dir for ADD and COPY instructions are proposed.

Implementation details

To make the proposed syntax possible, two separate changes have to be made.

--preserve-top-dir

Due to historical reasons ADD and COPY instructions behave differently from the well known nix cp utility when the sources list contains a directory. COPY ./src_dir ./ will copy only the contents of src_dir which is analogous to `$ cp -R ./src_dir/ ./`.

There is no direct way to copy the directory itself in buildkit/dockerfile right now (similarly to $ cp -R ./src_dir ./) . When you pair this issue with variable expansion (e.g. COPY $SOURCES ./) or when there is a need to copy multiple directories with one instruction while excluding some other sibling directories, the flexibility/transparency of the code is lost and many programming patterns rendered impossible.

This issue is known since 2015: moby/moby#15858

The following change is proposed: add --preserve-top-dir optional flag to ADD and COPY commands, to preserve the top level directory on copying, otherwise if flag is not set ADD/COPY will behave as usual. COPY ./src_dir ./ <-> $ cp -R ./src_dir/* ./ COPY --preserve-top-dir ./src_dir ./ <-> $ cp -R ./src_dir ./

--split-sources

Historically variable expansion in buildkit happens after word splitting in the dockerfile parsing logic. Hence it is impossible now to supply multiple sources to ADD/COPY via a single variable. In POSIX compliant shells the order of operations is reversed.

Swapping the order in buildkit would be harmful and dangerous. Therefore the following change is proposed: add --split-sources optional flag to ADD and COPY commands, that when set will split every element in sources array after variables are expanded the second time.

Similar approach is used by the env GNU coreutils utility. Historically in linux kernel users were able to pass only one optional argument via shebang. Starting from version 8.30 of env utility -S option is available:

process and split S into separate arguments; used to pass multiple arguments on shebang lines

POC

Compiled frontend @hub.docker.com: dyefimov/dockerfile:addcopy-scripting-improvements-labs Feature branch @github.com: DYefimov/buildkit/tree/addcopy-scripting-improvements

POC test script:

#!/usr/bin/env sh
set -e

TAG=addcopy_scripting_test
CONTAINER=$TAG
TEST_ROOT=./$TAG
TEST_SRC=$TEST_ROOT/src

rm -rf $TEST_ROOT
mkdir -p \
    $TEST_ROOT \
    $TEST_SRC \
    $TEST_SRC/subdir_to_copy \
    $TEST_SRC/subdir_to_copy_empty \
    $TEST_SRC/subdir_to_skip

touch \
    $TEST_SRC/file1 \
    $TEST_SRC/subdir_to_copy/file2 \
    $TEST_SRC/subdir_to_copy/file3 \
    $TEST_SRC/subdir_to_skip/file4 \
    $TEST_SRC/subdir_to_skip/file5

docker buildx build --load --no-cache -t $TAG -f- $TEST_ROOT << DOCKERFILE
# syntax=dyefimov/dockerfile:addcopy-scripting-improvements-labs

FROM scratch
WORKDIR /src

ARG SOURCES="./src/file1 ./src/subdir_to_copy ./src/subdir_to_copy_empty"

# fails with: failed to compute cache key: "/src/file1 ./src/subdir_to_copy ./src/subdir_to_copy_empty" not found: not found
# COPY \$SOURCES ./

COPY --split-sources --preserve-top-dir \$SOURCES ./

CMD ["sleep","3600"]
DOCKERFILE

docker container rm -f $CONTAINER
docker container create --name=$CONTAINER $TAG
docker container export $CONTAINER | tar tf - | grep src/ | sort

POC test script output:

$ ./proposal_poc.sh 
[+] Building 1.1s (10/10) FINISHED                                                                                                                                                    
 => [internal] load build definition from Dockerfile                                                                                                                             0.0s
 => => transferring dockerfile: 426B                                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                                0.0s
 => => transferring context: 2B                                                                                                                                                  0.0s
 => resolve image config for docker.io/dyefimov/dockerfile:addcopy-scripting-improvements-labs                                                                                   0.0s
 => CACHED docker-image://docker.io/dyefimov/dockerfile:addcopy-scripting-improvements-labs                                                                                      0.0s
 => [internal] load build definition from Dockerfile                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                                0.0s
 => [internal] load build context                                                                                                                                                0.0s
 => => transferring context: 227B                                                                                                                                                0.0s
 => CACHED [1/2] WORKDIR /src                                                                                                                                                    0.0s
 => [2/2] COPY --split-sources --preserve-top-dir ./src/file1 ./src/subdir_to_copy ./src/subdir_to_copy_empty ./                                                                 0.2s
 => exporting to image                                                                                                                                                           0.1s
 => => exporting layers                                                                                                                                                          0.1s
 => => writing image sha256:c0b6d9d84739a92c51c15a746a7122405050bfde604651f7dc648109c597ae5c                                                                                     0.0s
 => => naming to docker.io/library/addcopy_scripting_test                                                                                                                        0.0s
addcopy_scripting_test
196477307b4d1e9fd14ccf38c33f4f2082767086290846fd7f9ff86756343d33
src/
src/file1
src/subdir_to_copy/
src/subdir_to_copy_empty/
src/subdir_to_copy/file2
src/subdir_to_copy/file3
jedevc commented 2 years ago

Have you seen https://github.com/moby/moby/issues/35639 (there's also a longer discussion in https://github.com/moby/moby/issues/15858)? The issue proposes a --parents flag, similar to the cp --parents flag, which I think seems similar in functionality to your --preserve-top-dir? Personally, I prefer the name and the functionality of --parents, but tbf, I can't tell if there's a use case that it doesn't cover that --preserve-top-dir does? Regardless, I think a flag like this would be really useful :tada: CC @thaJeztah, we were talking about this yesterday :smile:

--split-sources - the surface problem I can see is that this doesn't actually handle filenames with spaces in them, and no matter what separator you pick, a file could potentially have those kind of characters in it. It feels like something that would really benefit if we had proper array support or something like that - allowing passing arrays as args, etc, and having some commands able to understand them - but that feels like a large syntax change to make, with some heavy implications across the parser.

thaJeztah commented 2 years ago

Yes, I think the --parents option could solve this; or at least for what I understand it's behavior would be.

Wouldn't hurt to read up on how a regular cp --parents works and if that's indeed what we're looking for of course

DYefimov commented 2 years ago

Using --preserve-top-dir and --split-sources myself for a couple of weeks in my real-world project, and indeed those capabilities proved to be very useful.

But, you are precisely correct - apparently, I'm lacking cp --parents like functionality to make my personal usecase feature-complete, and was considering of adding third --parents flag to this proposal.

The problem with the above/original proposal is that without --parents it's impossible to preserve directory structure (aka not "flatten"), which is required to selectively copy the whole directory tree (otherwise, "filtered directory tree") for the sake of adding more metaprogramming flexibility (and the main reason behind bundling two flags into this very proposal).

Side note - there is another method to consider - include/exclude dirs, e.g. you copy the root directory with --recursive flag while providing --include="*.go"

From what I can see now, --parents is a more general solution to a problem and will solve two issues with one shot.

Let me look into it - I'll try to swap --preserve-top-dir for --parents and run a testcase, to confirm --parents alone is sufficient, and we can drop --preserve-top-dir completely.

Minimalistic "flattening mode" tescase (btw, note how cp complains on conflicts in a flattening mode, while buildkit is silent):

Test script source code ```sh #!/usr/bin/env bash TAG=addcopy_scripting_test2 CONTAINER=$TAG TEST_ROOT=$(mktemp -d) cd $TEST_ROOT pwd mkdir -p ./src/sub{1,2} ./dst touch ./src/sub{1,2}/f{1,2} cat << DOCKERFILE > dockerfile # syntax=dyefimov/dockerfile:addcopy-scripting-improvements-labs FROM scratch WORKDIR /dst ARG SOURCES COPY --preserve-top-dir --split-sources \$SOURCES . CMD ["sleep","3600"] DOCKERFILE echo "Source tree:" find ./src | sort test() { echo "====================================" echo "Testing arg: $1" 2>/dev/null 1>&2 docker container rm -f $CONTAINER 2>/dev/null 1>&2 docker buildx build --load --no-cache -t $TAG --build-arg=SOURCES="$1" $TEST_ROOT 2>/dev/null 1>&2 docker container create --name=$CONTAINER $TAG echo "Container tree: " docker container export $CONTAINER | tar tf - | grep dst/ | sort 2>/dev/null 1>&2 docker container rm -f $CONTAINER 2>/dev/null 1>&2 docker image rm -f $TAG echo "cp tree: " rm -rf ./dst/* cp $1 ./dst/ find ./dst | sort echo "cp --parents tree: " rm -rf ./dst/* cp --parents $1 ./dst/ find ./dst | sort } test "src/sub1/f1" test "src/sub2/f1" test "src/sub*/f1 src/sub*/f2" ```

And test script output:

$ ./proposal_poc2.sh 
/tmp/tmp.lr18fzJ9vb
Source tree:
./src
./src/sub1
./src/sub1/f1
./src/sub1/f2
./src/sub2
./src/sub2/f1
./src/sub2/f2
====================================
Testing arg: src/sub1/f1
Container tree: 
dst/
dst/f1
cp tree: 
./dst
./dst/f1
cp --parents tree: 
./dst
./dst/src
./dst/src/sub1
./dst/src/sub1/f1
====================================
Testing arg: src/sub2/f1
Container tree: 
dst/
dst/f1
cp tree: 
./dst
./dst/f1
cp --parents tree: 
./dst
./dst/src
./dst/src/sub2
./dst/src/sub2/f1
====================================
Testing arg: src/sub*/f1 src/sub*/f2
Container tree: 
dst/
dst/f1
dst/f2
cp tree: 
cp: will not overwrite just-created './dst/f1' with 'src/sub2/f1'
cp: will not overwrite just-created './dst/f2' with 'src/sub2/f2'
./dst
./dst/f1
./dst/f2
cp --parents tree: 
./dst
./dst/src
./dst/src/sub1
./dst/src/sub1/f1
./dst/src/sub1/f2
./dst/src/sub2
./dst/src/sub2/f1
./dst/src/sub2/f2
DYefimov commented 2 years ago

Have you seen moby/moby#35639 (there's also a longer discussion in moby/moby#15858)?

Yes, I found around 5-7 issues, some of them like 5+ years old, all discussing the same or close functionality. Maybe we should consolidate all of them here (point to this very proposal via comment) to bring more opinions, and hopefully distill this into something meaningful and finally fix this annoyance?

--split-sources - the surface problem I can see is that this doesn't actually handle filenames with spaces in them, and no matter what separator you pick, a file could potentially have those kind of characters in it. It feels like something that would really benefit if we had proper array support or something like that - allowing passing arrays as args, etc, and having some commands able to understand them - but that feels like a large syntax change to make, with some heavy implications across the parser.

That sounds like a complex task and a whole new layer of complexity (source of bugs). Can't we keep it simple but yet flexible? What do you think if we just extend the proposed method with a proper quotation support? e.g. split by separator, while quoted paths not splitted but unquoted.

jedevc commented 2 years ago

Let me look into it - I'll try to swap --preserve-top-dir for --parents and run a testcase, to confirm --parents alone is sufficient, and we can drop --preserve-top-dir completely.

Excellent :heart:

there is another method to consider - include/exclude dirs

I'm sure this one has seen some discussion before as well, https://github.com/moby/buildkit/issues/2853, https://github.com/moby/moby/issues/15771, it just looks to me like the discussion has stalled around a lack of people wanting to/able to put in the time to put together a contribution.

Maybe we should consolidate all of them here

Could be worth making sure that similar issues that would be solved by similar solutions are linked. I think though it's probably worth discussing individual flags separately from each other, since I think they're mostly independent.

while quoted paths not splitted but unquoted.

This is exactly the thing that scares me :smile: Dockerfiles don't implement a fully sh-compatible quote parser, for some fairly good reasons - it gets quite tricky to handle the small details, and the existing structure of the parser makes it quite challenging to do well. Someone else might have some better thoughts than me here, I'm not as aware of the historical context for some of this stuff as others.

Slightly related - one thought that I did have - if we're adding more flags to more places (other candidates for this include RUN), we're starting to get to the point of having a single flag repeated over and over, already you can see this with --link. We can't change default behavior to reduce the noise, but could we potentially allow a directive/additional syntax to allow setting the default flags for a command? I can especially see this --link and --parents which if you were writing a Dockerfile from scratch, you'd probably just want to set those defaults yourself.

errordeveloper commented 2 years ago

I am just wondering whether this limitation is documented? Are there other cases where build arugment substitution is not working right now? I recall that initially usage was fairly limited, but I had an impression that it improved over time, but it's not like I can recall of concrete changes where substitutions became possible.

errordeveloper commented 2 years ago

The most valuable application of such a construct is dockerfile target reusability...

I don't disssagree with the proposal being useful, but I do believe that currently very similar idea can be implemented by specifying alternative build context, or pehrpaps (somewhat more hack-y) context preparation, which could involve generated .dockeringnore files... Alternatively, I also wonder if something like RUN --mount=$SOURCE is currently possible? Which could be another way to achieve the same goal, perhaps?

jedevc commented 2 years ago

I am just wondering whether this limitation is documented?

I don't think it's documented anywhere. We do argument substitution pretty much everywhere, the case where it's not working atm is in doing word splitting, like how POSIX sh does - i.e. turning x="123 456"; cp $x into cp 123 456 - we don't do anything like that at the moment, which I think is the core of what the --split-sources option would do. We don't advertise the current lack of the feature, as far as I'm aware. Hope I'm reading the proposal right, I think that should be correct.

specifying alternative build context, or pehrpaps (somewhat more hack-y) context preparation

Not quite sure what this means - I can't easily see how this would help with something like a --parents flag. I think we also probably want to avoid encouraging hacky pre-dockerfile steps as much as possible, ideally a build process should be as straightforward as possible.

`RUN --mount=$SOURCE

I remember some discussion on this internally. The summary as I remember it: it's a good work-around, but it has a couple issues - ExecOp caches much more poorly than the corresponding FileOp, and doesn't work with constructs like --link, and requires that you have cp and a normal sh installed to run the command.

DYefimov commented 2 years ago

I don't disssagree with the proposal being useful, but I do believe that currently very similar idea can be implemented by specifying alternative build context, or pehrpaps (somewhat more hack-y) context preparation, which could involve generated .dockeringnore files... Alternatively, I also wonder if something like RUN --mount=$SOURCE is currently possible? Which could be another way to achieve the same goal, perhaps?

In fact, that's exactly what I'm trying to avoid here. Of cause, everything in software engineering is a trade of, and there are many ways to achieve the same goal. I'm pushing here on "Linux philosophy", "Simplicity", and "Flexibility". You might be interested to look at #2893 where I posted a more detailed real-world use case that brought me to this proposal.

We definitely can do the above the other "hacky" way, but IMHO the change here is very cheep for the benefits we gain.

I don't think it's documented anywhere. We do argument substitution pretty much everywhere, the case where it's not working atm is in doing word splitting, like how POSIX sh does - i.e. turning x="123 456"; cp $x into cp 123 456 - we don't do anything like that at the moment, which I think is the core of what the --split-sources option would do. We don't advertise the current lack of the feature, as far as I'm aware. Hope I'm reading the proposal right, I think that should be correct.

The root cause is the reversed order of operations in buildkit compared to POSIX shell (POSIX way: parameter expansion, then word splitting). Swapping the order in buildkit seems to be a horrible idea. Proof: frontend/dockerfile/dockerfile2llb/convert.go

RUN --mount=$SOURCE

That's what I tried first - it's possible - but I was incapable of avoiding cache-misses for file changes not relevant for a particular RUN command. Tbh not sure if I explored all the capabilities of the RUN --mount. Some more details here: #2893

thaJeztah commented 2 years ago

Minimalistic "flattening mode" tescase (btw, note how cp complains on conflicts in a flattening mode, while buildkit is silent):

There's also a test-case/example (but in relation to permissions for parent directories which could be fixed with the --parents options) I wrote in; https://github.com/moby/moby/issues/38710#issuecomment-462734118

RUN --mount=$SOURCE That's what I tried first - it's possible - but I was incapable of avoiding cache-misses for file changes not relevant for a particular RUN command. Tbh not sure if I explored all the capabilities of the RUN --mount.

Yes, that's a tricky one indeed; when discussing some things internally, I also suggested someone to use RUN --mount ... cp <complex matching> instead of COPY (with complex matching rules), but then didn't consider that the mount would invalidate the RUN whenever any file in the build-context changes (unlike --mount=type=cache), so RUN --mount is not a good solution if the intent is to "only get some files from the build-context" (but ignore other files). I don't think that'd be easy (or even possible) to resolve in BuildKit, unless buildkit would understand what files the RUN needed in order to determine if the cache can be used.

... which is what led the internal discussion to the --parents proposal 😂

jedevc commented 2 years ago

@DYefimov sorry for the ping, was wondering if you'd managed to get any further with your PoC? If you've got a working --parents PoC, think it would be nice to discuss over in a PR?

DYefimov commented 2 years ago

@DYefimov sorry for the ping, was wondering if you'd managed to get any further with your PoC? If you've got a working --parents PoC, think it would be nice to discuss over in a PR?

@jedevc No worries. It appeared to be a more sophisticated change than others I've done so far for the buildkit, as it modifies not just the frontend alone. So I will definitely require your assistance down the road to polish it the proper way. Anyways, I got the barebone working. Please, let me play with it for several days more, and PR by the end of the week, as you propose.

DYefimov commented 2 years ago

Not sure why all of a sudden vscode decided to reset my sign-off all setting... sorry for this blot.

@jedevc @thaJeztah Here is the PR to entice further discussion. --parents supersedes --preserve-top-dir as expected. Unfortunately I can't reliably run the test suite, due to poor GSM uplink atm. I'm coming back on 15th and will provide the test suite for the PR then.

jedevc commented 2 years ago

No worries, thanks @DYefimov!

Will continue the conversation over there :smile: No worries about the tests, that's what CI is for :tada:

DYefimov commented 2 years ago

@tonistiigi @crazy-max @thaJeztah Can I kindly ping you and draw your attention back to this very issue. The PR stays ready for your reviews/suggestions/approval for three weeks now.

And what are your thoughts on the second proposition here, the --split-sources flag? This one is way less intrusive, but IMHO expands the flexibility of the Buildkit immensely. Can I fire second PR to cover it?

PAStheLoD commented 1 year ago

Hello,

One more thing that would be amazing, is to allow filtering out directories that get ADD/COPY-ed. (My use case is that I don't want to pollute the top level of a monorepo, and there's no support for nested .dockerignores.)

Basically something like COPY --ignore '**/node_modules' ./my-precious ./app.

Thanks for considering!

thaJeztah commented 1 year ago

@PAStheLoD related, but tracked in https://github.com/moby/buildkit/issues/2853