Open DYefimov opened 2 years ago
Have you seen https://github.com/moby/moby/issues/35639 (there's also a longer discussion in https://github.com/moby/moby/issues/15858)? The issue proposes a --parents
flag, similar to the cp --parents
flag, which I think seems similar in functionality to your --preserve-top-dir
? Personally, I prefer the name and the functionality of --parents
, but tbf, I can't tell if there's a use case that it doesn't cover that --preserve-top-dir
does? Regardless, I think a flag like this would be really useful :tada: CC @thaJeztah, we were talking about this yesterday :smile:
--split-sources
- the surface problem I can see is that this doesn't actually handle filenames with spaces in them, and no matter what separator you pick, a file could potentially have those kind of characters in it. It feels like something that would really benefit if we had proper array support or something like that - allowing passing arrays as args, etc, and having some commands able to understand them - but that feels like a large syntax change to make, with some heavy implications across the parser.
Yes, I think the --parents
option could solve this; or at least for what I understand it's behavior would be.
Wouldn't hurt to read up on how a regular cp --parents
works and if that's indeed what we're looking for of course
Using --preserve-top-dir
and --split-sources
myself for a couple of weeks in my real-world project, and indeed those capabilities proved to be very useful.
But, you are precisely correct - apparently, I'm lacking cp --parents
like functionality to make my personal usecase feature-complete, and was considering of adding third --parents
flag to this proposal.
The problem with the above/original proposal is that without --parents
it's impossible to preserve directory structure (aka not "flatten"), which is required to selectively copy the whole directory tree (otherwise, "filtered directory tree") for the sake of adding more metaprogramming flexibility (and the main reason behind bundling two flags into this very proposal).
Side note - there is another method to consider - include/exclude dirs, e.g. you copy the root directory with --recursive
flag while providing --include="*.go"
From what I can see now, --parents
is a more general solution to a problem and will solve two issues with one shot.
Let me look into it - I'll try to swap --preserve-top-dir
for --parents
and run a testcase, to confirm --parents
alone is sufficient, and we can drop --preserve-top-dir
completely.
Minimalistic "flattening mode" tescase (btw, note how cp complains on conflicts in a flattening mode, while buildkit is silent):
And test script output:
$ ./proposal_poc2.sh
/tmp/tmp.lr18fzJ9vb
Source tree:
./src
./src/sub1
./src/sub1/f1
./src/sub1/f2
./src/sub2
./src/sub2/f1
./src/sub2/f2
====================================
Testing arg: src/sub1/f1
Container tree:
dst/
dst/f1
cp tree:
./dst
./dst/f1
cp --parents tree:
./dst
./dst/src
./dst/src/sub1
./dst/src/sub1/f1
====================================
Testing arg: src/sub2/f1
Container tree:
dst/
dst/f1
cp tree:
./dst
./dst/f1
cp --parents tree:
./dst
./dst/src
./dst/src/sub2
./dst/src/sub2/f1
====================================
Testing arg: src/sub*/f1 src/sub*/f2
Container tree:
dst/
dst/f1
dst/f2
cp tree:
cp: will not overwrite just-created './dst/f1' with 'src/sub2/f1'
cp: will not overwrite just-created './dst/f2' with 'src/sub2/f2'
./dst
./dst/f1
./dst/f2
cp --parents tree:
./dst
./dst/src
./dst/src/sub1
./dst/src/sub1/f1
./dst/src/sub1/f2
./dst/src/sub2
./dst/src/sub2/f1
./dst/src/sub2/f2
Have you seen moby/moby#35639 (there's also a longer discussion in moby/moby#15858)?
Yes, I found around 5-7 issues, some of them like 5+ years old, all discussing the same or close functionality. Maybe we should consolidate all of them here (point to this very proposal via comment) to bring more opinions, and hopefully distill this into something meaningful and finally fix this annoyance?
--split-sources
- the surface problem I can see is that this doesn't actually handle filenames with spaces in them, and no matter what separator you pick, a file could potentially have those kind of characters in it. It feels like something that would really benefit if we had proper array support or something like that - allowing passing arrays as args, etc, and having some commands able to understand them - but that feels like a large syntax change to make, with some heavy implications across the parser.
That sounds like a complex task and a whole new layer of complexity (source of bugs). Can't we keep it simple but yet flexible? What do you think if we just extend the proposed method with a proper quotation support? e.g. split by separator, while quoted paths not splitted but unquoted.
Let me look into it - I'll try to swap --preserve-top-dir for --parents and run a testcase, to confirm --parents alone is sufficient, and we can drop --preserve-top-dir completely.
Excellent :heart:
there is another method to consider - include/exclude dirs
I'm sure this one has seen some discussion before as well, https://github.com/moby/buildkit/issues/2853, https://github.com/moby/moby/issues/15771, it just looks to me like the discussion has stalled around a lack of people wanting to/able to put in the time to put together a contribution.
Maybe we should consolidate all of them here
Could be worth making sure that similar issues that would be solved by similar solutions are linked. I think though it's probably worth discussing individual flags separately from each other, since I think they're mostly independent.
while quoted paths not splitted but unquoted.
This is exactly the thing that scares me :smile: Dockerfiles don't implement a fully sh-compatible quote parser, for some fairly good reasons - it gets quite tricky to handle the small details, and the existing structure of the parser makes it quite challenging to do well. Someone else might have some better thoughts than me here, I'm not as aware of the historical context for some of this stuff as others.
Slightly related - one thought that I did have - if we're adding more flags to more places (other candidates for this include RUN
), we're starting to get to the point of having a single flag repeated over and over, already you can see this with --link
. We can't change default behavior to reduce the noise, but could we potentially allow a directive/additional syntax to allow setting the default flags for a command? I can especially see this --link
and --parents
which if you were writing a Dockerfile from scratch, you'd probably just want to set those defaults yourself.
I am just wondering whether this limitation is documented? Are there other cases where build arugment substitution is not working right now? I recall that initially usage was fairly limited, but I had an impression that it improved over time, but it's not like I can recall of concrete changes where substitutions became possible.
The most valuable application of such a construct is dockerfile target reusability...
I don't disssagree with the proposal being useful, but I do believe that currently very similar idea can be implemented by specifying alternative build context, or pehrpaps (somewhat more hack-y) context preparation, which could involve generated .dockeringnore
files...
Alternatively, I also wonder if something like RUN --mount=$SOURCE
is currently possible? Which could be another way to achieve the same goal, perhaps?
I am just wondering whether this limitation is documented?
I don't think it's documented anywhere. We do argument substitution pretty much everywhere, the case where it's not working atm is in doing word splitting, like how POSIX sh does - i.e. turning x="123 456"; cp $x
into cp 123 456
- we don't do anything like that at the moment, which I think is the core of what the --split-sources
option would do. We don't advertise the current lack of the feature, as far as I'm aware. Hope I'm reading the proposal right, I think that should be correct.
specifying alternative build context, or pehrpaps (somewhat more hack-y) context preparation
Not quite sure what this means - I can't easily see how this would help with something like a --parents
flag. I think we also probably want to avoid encouraging hacky pre-dockerfile steps as much as possible, ideally a build process should be as straightforward as possible.
`RUN --mount=$SOURCE
I remember some discussion on this internally. The summary as I remember it: it's a good work-around, but it has a couple issues - ExecOp caches much more poorly than the corresponding FileOp, and doesn't work with constructs like --link
, and requires that you have cp
and a normal sh
installed to run the command.
I don't disssagree with the proposal being useful, but I do believe that currently very similar idea can be implemented by specifying alternative build context, or pehrpaps (somewhat more hack-y) context preparation, which could involve generated
.dockeringnore
files... Alternatively, I also wonder if something likeRUN --mount=$SOURCE
is currently possible? Which could be another way to achieve the same goal, perhaps?
In fact, that's exactly what I'm trying to avoid here. Of cause, everything in software engineering is a trade of, and there are many ways to achieve the same goal. I'm pushing here on "Linux philosophy", "Simplicity", and "Flexibility". You might be interested to look at #2893 where I posted a more detailed real-world use case that brought me to this proposal.
We definitely can do the above the other "hacky" way, but IMHO the change here is very cheep for the benefits we gain.
I don't think it's documented anywhere. We do argument substitution pretty much everywhere, the case where it's not working atm is in doing word splitting, like how POSIX sh does - i.e. turning x="123 456"; cp $x into cp 123 456 - we don't do anything like that at the moment, which I think is the core of what the --split-sources option would do. We don't advertise the current lack of the feature, as far as I'm aware. Hope I'm reading the proposal right, I think that should be correct.
The root cause is the reversed order of operations in buildkit compared to POSIX shell (POSIX way: parameter expansion, then word splitting). Swapping the order in buildkit seems to be a horrible idea. Proof: frontend/dockerfile/dockerfile2llb/convert.go
RUN --mount=$SOURCE
That's what I tried first - it's possible - but I was incapable of avoiding cache-misses for file changes not relevant for a particular RUN command. Tbh not sure if I explored all the capabilities of the RUN --mount. Some more details here: #2893
Minimalistic "flattening mode" tescase (btw, note how cp complains on conflicts in a flattening mode, while buildkit is silent):
There's also a test-case/example (but in relation to permissions for parent directories which could be fixed with the --parents
options) I wrote in; https://github.com/moby/moby/issues/38710#issuecomment-462734118
RUN --mount=$SOURCE That's what I tried first - it's possible - but I was incapable of avoiding cache-misses for file changes not relevant for a particular RUN command. Tbh not sure if I explored all the capabilities of the RUN --mount.
Yes, that's a tricky one indeed; when discussing some things internally, I also suggested someone to use RUN --mount ... cp <complex matching>
instead of COPY
(with complex matching rules), but then didn't consider that the mount would invalidate the RUN
whenever any file in the build-context changes (unlike --mount=type=cache
), so RUN --mount
is not a good solution if the intent is to "only get some files from the build-context" (but ignore other files). I don't think that'd be easy (or even possible) to resolve in BuildKit, unless buildkit would understand what files the RUN
needed in order to determine if the cache can be used.
... which is what led the internal discussion to the --parents
proposal 😂
@DYefimov sorry for the ping, was wondering if you'd managed to get any further with your PoC? If you've got a working --parents
PoC, think it would be nice to discuss over in a PR?
@DYefimov sorry for the ping, was wondering if you'd managed to get any further with your PoC? If you've got a working
--parents
PoC, think it would be nice to discuss over in a PR?
@jedevc No worries. It appeared to be a more sophisticated change than others I've done so far for the buildkit, as it modifies not just the frontend alone. So I will definitely require your assistance down the road to polish it the proper way. Anyways, I got the barebone working. Please, let me play with it for several days more, and PR by the end of the week, as you propose.
Not sure why all of a sudden vscode decided to reset my sign-off all
setting... sorry for this blot.
@jedevc @thaJeztah
Here is the PR to entice further discussion.
--parents
supersedes --preserve-top-dir
as expected.
Unfortunately I can't reliably run the test suite, due to poor GSM uplink atm. I'm coming back on 15th and will provide the test suite for the PR then.
No worries, thanks @DYefimov!
Will continue the conversation over there :smile: No worries about the tests, that's what CI is for :tada:
@tonistiigi @crazy-max @thaJeztah Can I kindly ping you and draw your attention back to this very issue. The PR stays ready for your reviews/suggestions/approval for three weeks now.
And what are your thoughts on the second proposition here, the --split-sources
flag?
This one is way less intrusive, but IMHO expands the flexibility of the Buildkit immensely.
Can I fire second PR to cover it?
Hello,
One more thing that would be amazing, is to allow filtering out directories that get ADD/COPY-ed. (My use case is that I don't want to pollute the top level of a monorepo, and there's no support for nested .dockerignores.)
Basically something like COPY --ignore '**/node_modules' ./my-precious ./app
.
Thanks for considering!
@PAStheLoD related, but tracked in https://github.com/moby/buildkit/issues/2853
related;
Abstract
Currently the following construct is impossible:
Therefore severely limiting metaprogramming capabilities of buildkit and dockerfiles.
The most valuable application of such a construct is dockerfile target reusability, e.g. you can build lots of similar images that differ only by filesystem content without code duplication or any external tools like templating engines.
You can check the real world use case in more detail at #2893 where this RFC originates from.
To achieve such functionality without introducing any breaking change while retaining backward compatibility, two optional flags
--split-sources
and--preserve-top-dir
forADD
andCOPY
instructions are proposed.Implementation details
To make the proposed syntax possible, two separate changes have to be made.
--preserve-top-dir
Due to historical reasons
ADD
andCOPY
instructions behave differently from the well known nixcp
utility when the sources list contains a directory.COPY ./src_dir ./
will copy only the contents ofsrc_dir
which is analogous to `$ cp -R ./src_dir/ ./`.There is no direct way to copy the directory itself in buildkit/dockerfile right now (similarly to
$ cp -R ./src_dir ./
) . When you pair this issue with variable expansion (e.g.COPY $SOURCES ./
) or when there is a need to copy multiple directories with one instruction while excluding some other sibling directories, the flexibility/transparency of the code is lost and many programming patterns rendered impossible.This issue is known since 2015: moby/moby#15858
The following change is proposed: add
--preserve-top-dir
optional flag toADD
andCOPY
commands, to preserve the top level directory on copying, otherwise if flag is not setADD
/COPY
will behave as usual.COPY ./src_dir ./
<->$ cp -R ./src_dir/* ./
COPY --preserve-top-dir ./src_dir ./
<->$ cp -R ./src_dir ./
--split-sources
Historically variable expansion in buildkit happens after word splitting in the dockerfile parsing logic. Hence it is impossible now to supply multiple sources to
ADD
/COPY
via a single variable. In POSIX compliant shells the order of operations is reversed.Swapping the order in buildkit would be harmful and dangerous. Therefore the following change is proposed: add
--split-sources
optional flag toADD
andCOPY
commands, that when set will split every element in sources array after variables are expanded the second time.Similar approach is used by the
env
GNU coreutils utility. Historically in linux kernel users were able to pass only one optional argument via shebang. Starting from version 8.30 ofenv
utility-S
option is available:POC
Compiled frontend @hub.docker.com: dyefimov/dockerfile:addcopy-scripting-improvements-labs Feature branch @github.com: DYefimov/buildkit/tree/addcopy-scripting-improvements
POC test script:
POC test script output: