moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.51k stars 18.63k forks source link

COPY with excluded files is not possible #15771

Open bronger opened 9 years ago

bronger commented 9 years ago

I need to COPY a part of a context directory to the container (the other part is subject to another COPY). Unfortunately, the current possibilities for this are suboptimal:

  1. COPY and prune. I could remove the unwanted material after an unlimited COPY. The problem is that the unwanted material may have changed, so the cache is invalidated.
  2. COPY every file in a COPY instruction of it own. This adds a lot of unnecessary layers to the image.
  3. Writing a wrapper around the "docker build" call that prepares the context in some way so that the Dockerfile can comfortably copy the wanted material. Cumbersome and difficult to maintain.
cpuguy83 commented 9 years ago

See https://docs.docker.com/reference/builder/#dockerignore-file You can add entries to a .dockerignore file in the root of the project.

bronger commented 9 years ago

.dockerignore does not solve this issue. As I wrote, "the other part is subject to another COPY".

cpuguy83 commented 9 years ago

So you want to conditionally copy based on some other copy?

bronger commented 9 years ago

The context contains a lot of directories A1...A10 and a directory B. A1...A10 have one destination, B has another:

COPY A1 /some/where/A1/
COPY A2 /some/where/A2/
...
COPY A10 /some/where/A10/
COPY B some/where/else/B/

And this is awkward.

cpuguy83 commented 9 years ago

What part of it is awkward? Listing them all individually?

COPY A* /some/where/
COPY B /some/where/else/

Does this work?

bronger commented 9 years ago

The names A1..A10, B were fake. Besides, COPY A* ... throws together the contents of the directories.

There are a couple of options I admit, but I think that all of them are awkward. I mentioned three in my original posting. A fourth option is to rearrange my source code permanently so that A1..A10 are moved in a new directory A. I was hoping that this was not necessary because an additional nesting level is not something to wish for, and my current tools needed to special-case my dockerised projects then.

(BTW, #6094 (following symlinks) would help in this case. But apparently, this is no option either.)

cpuguy83 commented 9 years ago

@bronger if COPY behaved exactly like cp, would that solve your use-case?

I'm not sure I 100% understand. Maybe @duglin can have a look.

duglin commented 9 years ago

@bronger I think @cpuguy83 asked the right question, how would you solve this if you were using 'cp' ? I looked and didn't notice some kind of excludes option on 'cp' so I'm not sure how you would solve this outside of a 'docker build' either.

bronger commented 9 years ago

With cp behaviour, I could ameliorate the situation by saying

COPY ["A1", ... "A10", "/some/where/"]

It's still a mild maintenance problem because I would have to think of that line if I added an "A11" directory. But that would be acceptable.

Besides, cp does not need excludes, because copying everything and removing the unwanted parts has almost no performance impact beyond the copying itself. With docker's COPY, it means wrongly invalidated cache every time B is changed, and bigger images.

duglin commented 9 years ago

@bronger you can do:

COPY a b c d /some/where

just like you were suggesting.

As for doing a RUN rm ... after the COPY ..., yes you'll have on extra layer, but you still should be able to use the cache. If you see a cache miss due to it let me know, I don't think you should.

bronger commented 9 years ago

But

COPY a b c d /some/where/

copies the contents of the directories a b c d together, instead of creating the directories /some/where/{a,b,c,d}. It works like rsync with a slash appended to the src directory. Therefore, the four instructions

COPY a /some/where/a/
COPY b /some/where/b/
COPY c /some/where/c/
COPY d /some/where/d/

are needed.

As for the cache ... if I say

COPY . /some/where/
RUN rm -Rf /some/where/e

then the cache is not used if e changes, although e is not effectively included into the operation.

duglin commented 9 years ago

@bronger yep, sadly you're correct. I guess we could add a --exclude zzz type of flag, but per https://github.com/docker/docker/blob/master/ROADMAP.md#22-dockerfile-syntax it may not get a lot of traction right now.

bronger commented 9 years ago

Fair enough. Then I will use a COPY+rm for the time being and add a FixMe comment. Thank you for your time!

pwaller commented 9 years ago

Just to :+1: this issue. I regularly regret that COPY doesn't mirror rsync's trailing slash semantics. It means you can't COPY multiple directories in a single statement, leading to layer proliferation.

I regularly encounter a case where I want to copy many directories except for one (which will be copied later, because I want it to have different layer-invalidation effects), so --exclude would be useful, as well.

Also, from man rsync:

       A trailing slash on the source changes this behavior to avoid  creating
       an  additional  directory level at the destination.  You can think of a
       trailing / on a source as meaning "copy the contents of this directory"
       as  opposed  to  "copy  the  directory  by name", but in both cases the
       attributes of the containing directory are transferred to the  containโ€
       ing  directory on the destination.  In other words, each of the followโ€
       ing commands copies the files in the same way, including their  setting
       of the attributes of /dest/foo:

              rsync -av /src/foo /dest
              rsync -av /src/foo/ /dest/foo

I guess it can't be changed now without breaking a lot of wild Dockerfiles.

pwaller commented 9 years ago

As a concrete example, let's say I have a directory looking like this:

/vendor
/part1
/part2
/part3
/...
/partN

I want something that looks like:

COPY /vendor /docker/vendor
RUN /vendor/build
COPY /part1 /part2 ... /partN /docker/ # copy directories part1-N to /docker/part{1..N}/
RUN /docker/build1-N.sh

So that part1-N doesn't invalidate building of /vendor. (since /vendor is rarely updated compared to part1-N).

I have previously worked around this by putting part1-N in their own directory, so:

/vendor
/src/part1-N

But I have also encountered this problem in projects that I am not at liberty to rearrange quite so easily.

antoineco commented 8 years ago

@praller good example, we're facing the exact same issue. The main problem is that Go's filepath.Match doesn't allow much creativity compared to regular expressions (i.e. no anti pattern)

jason-kane commented 8 years ago

I just came up with a somewhat crack-brained workaround for this. COPY can't exclude directories, but ADD can expand tgz.

It's one extra build step: tar --exclude='./deferred_copy' -czf all_but_deferred.tgz . docker build ...

Then in your Dockerfile: ADD ./all_but_deferred.tgz /application_dir/ .. stuff in the rarely changing layers .. ADD . /application_dir/ .. stuff in the often changing layers

That gives the full syntax of tar for including/excluding/whatever without gobs of wasted layers trying to include/exclude.

mikeknep commented 8 years ago

@jason-kane This is nice trick, thanks for sharing. One small point: it looks like you can't add the z (gzip) flag to tarโ€”it changes the sha256 checksum value, which invalidates the Docker cache. Otherwise this approach works great for me.

matthewmueller commented 8 years ago

+1 for this issue, I think it could be supported in the same way a lot of glob libraries support it:

Here's a proposal to copy everything except node_modules

COPY . /app -node_modules/
duypm commented 8 years ago

I come across the same problem as well, and it's kind of painful for me when my Java webapps is about 900MB but almost 80% of that is rarely changed. It's an early state of my application and the folder structure is somewhat stable so I don't mind adding 6-7 COPY layer to be able to use the cache, but it will surely hurt in the long term when more and more files and directories are added

jfroffice commented 8 years ago

๐Ÿ‘

kkozmic-seek commented 8 years ago

I have the same problem although with docker cp, I want to copy all files from a folder except for one

oaxlin commented 8 years ago

Exact same issue here. I want to copy a git repo and exclude the .git directory.

antoineco commented 8 years ago

@oaxlin you could use the .dockerignore file for that.

kkozmic-seek commented 8 years ago

@antoineco are you sure that will work? It's been a while since I tried but I'm pretty sure .dockerignore didn't work with docker cp, at least at the time

antoineco commented 8 years ago

@kkozmic-seek absolutely sure :) But the docker cp CLI subcommand you mentioned is different from the COPY statement found in the Dockerfile, which is the scope this issue.

docker cp has indeed nothing to do with Dockerfile and . dockerignore, but on the other hand it's not used for building images.

maresja1 commented 8 years ago

Would really like this as well - to speed up build I could copy some folder in earlier parts of the build and then cache would help me out ...

olalonde commented 7 years ago

I'm not sure I understand what the use case is but wouldn't just touching the files to exclude before COPY solve the problem?

RUN touch /app/node_modules
COPY . /app
RUN rm /app/node_modules

AFAIK COPY doesn't overwrite file which is why I think this might work.

olalonde commented 7 years ago

Oops, never mind that, looks like COPY actually overwrites files. I'm now a bit puzzled by https://nodejs.org/en/docs/guides/nodejs-docker-webapp/ which npm installs and then does a COPY . /usr/src/app. I guess it assumes that node_modules is docker ignored? On the other hand, having a COPY_NO_OVERWRITE (better name needed) command could be one way to achieve ignoring files during copy (you'd have to create empty files/dirs for stuff you want to ignore).

bronger commented 7 years ago

FWIW, I find this very ugly.

adresdvila commented 7 years ago

I found another hack solution:

Example project structure: app/ config/ script/ spec/ static/ ...

We want:

  1. Copy static/
  2. Copy other files
  3. Copy app/

Hack solution: ADD ./static /home/app ADD ["./[^s^a]*", "./s[^t]*", "/home/app/"] ADD ./app /home/app

Second ADD is equivalent of: copy all, exept "./st" and "./a". Any ideas for improvements?

brunocascio commented 7 years ago

Which is the status of comment?

loretoparisi commented 7 years ago

๐Ÿ‘

navgarcha commented 7 years ago

๐Ÿ‘

jhagege commented 7 years ago

๐Ÿ‘

broilogabriel commented 7 years ago

๐Ÿ‘

mirestrepo commented 7 years ago

what about having a .dockerignore file in the same fashion than .gitignore?

bronger commented 7 years ago

@mirestrepo See the first two follow-ups to this issue.

joelharkes commented 7 years ago

Currently this is a mega perf nerf for C# / dotnet development.

What i want:

Now it seems this is not (easily) possible because i cannot copy everything except.

So either dlls are copied double Which increases the docker file size or everything is copied in one layer. The later being a mega nerf because external dlls are copied everytime instead of cached.

@adresdvila thanks for the solutoin i was able to split it up in:

COPY ["[^M][^y]*","/app/"] 
COPY ./My* /app/

Although this still leave the problem that .json files are copied at the first command

oaxlin commented 7 years ago

Just chiming in to say thanks to @antoineco my problem is solved. I no longer copy the .git directory into my docker images.

This dramatically improved the image size, and makes my image much more friendly to the docker caching system.

engrut commented 7 years ago

I have the same problem. I have a big file which I want to copy before the rest of files so any change in the context does not repeat it as it takes a lot of time to copy (7 GB bin file). Are there any new workarounds?

mitar commented 7 years ago

The issue with COPY and prune approach is that the layer before pruning still continue to have all the data in.

Nowaker commented 6 years ago

COPY . --exclude=a --exclude=b would be extremely useful. What do you think, @cpuguy83?

cpuguy83 commented 6 years ago

@Nowaker I like it. Seems in line with tar and rsync anyway. I guess this should support the same format as dockerignore?

@tonistiigi @dnephin

dnephin commented 6 years ago

This case would be handled by #32507 I think.

Nowaker commented 6 years ago

@cpuguy83 Yeah. Most notably, in line with COPY --chown=uid:gid

@dnephin RUN --mount sounds like a totally different use case, centered around generating something based on data we don't need after the output has been generated. (E.g. compiling with Go, generating HTMLs from Markdown file, etc). RUN --mount is dope and I'd definitely use it in the project I'm currently working on (generating API docs using Sphinx).

COPY somedir --exclude=excludeddir1 --exclude=excludeddir2 is centered around copying data that has to end up in the image but splattered across multiple COPY statements, not just one. The goal is to avoid explicit COPY first second third .... eleventh destination/ when project has a lot of directories in root and it's subject to change/increase.

In my very case, I want to copy most of the files except those that are non-essential first to make sure cache is used if source files didn't change. Then, compile/generate - and use cache if the copied files didn't change (yay). At the very beginning copy the files I excluded previously which might have changed since the previous build but their change doesn't affect the compile/generate. Obviously, I have a ton files and directories in . that I want to COPY first, and only a couple that I want to COPY somewhere at the end.

dnephin commented 6 years ago

The idea is that RUN --mount is able to solve a lot of problems. COPY --exclude solves only a single problem.

I'd rather add something that solves a lot problems than add a bunch of syntax to solve individual problems. You would use RUN --mount... rsync --exclude ... (or some script that copies individual things) and it would be the equivalent to COPY --exclude.

Nowaker commented 6 years ago

@dnephin Oh, I didn't think of RUN --mount rsync! Excellent! ๐Ÿ‘

antoineco commented 6 years ago

That's excellent indeed. However you won't be able to leverage caching efficiently @Nowaker, because the cache will be invalidated if anything changes in the mounted directory, not only what you want to rsync.

tonistiigi commented 6 years ago

If you use the output of that rsync as an input for something else and no files actually changed in there the cache will pick up again. If you are really up for it you can do this currently with something like https://gist.github.com/tonistiigi/38ead7a4ed60565996d207a7d589d9c4#file-gistfile1-txt-L130-L140 . Only change in RUN --mount (or LLB in buildkit) is that you don't have to actually copy files between stages but can access them directly so it is much faster.