pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.31k stars 635 forks source link

`./pants package` Rebuilding Issue #16101

Closed kwrobert closed 2 years ago

kwrobert commented 2 years ago

Describe the bug

When rebuilding a child docker_image target that depends on an upstream parent image as a base, running ./pants package path/to:child_target does not appear to correctly rebuild the parent image before building the child image.

Pants version

$ ./pants --version
13:05:07.85 [INFO] Initializing scheduler...
13:05:08.16 [INFO] Scheduler initialized.
2.11.0

OS

Currently only on Mac OS X, haven't tested on Linux yet

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.15.7
BuildVersion:   19H1922

Additional info

Here is the process I went through. First, here is the directory layout from the top of the repo:

$ tree infra/docker
infra/docker
├── build_runner
│   ├── BUILD
│   └── Dockerfile
├── build_runner_child
│   ├── BUILD
│   └── Dockerfile

And the contents of the BUILD files and Dockerfiles:

$ cat infra/docker/build_runner/BUILD
docker_image(
    name="build_runner",
    image_tags=["build_runner_pants"],
)
$ cat infra/docker/build_runner/Dockerfile
FROM cimg/python:3.8

# hadolint ignore=DL3008,DL3004
RUN sudo apt-get update -y \
    && sudo apt-get install  --no-install-recommends -y python3-dev python3-distutils \
    amazon-ecr-credential-helper \
    awscli \
    build-essential \
    libssl-dev \
    libffi-dev \
    && sudo apt-get clean \
    && sudo rm -rf /var/lib/apt/lists/*

# hadolint ignore=DL3013
RUN pip install --no-cache-dir credstash
$ cat infra/docker/build_runner_child/BUILD
docker_image(
    name="build_runner_child",
    image_tags=["build_runner_child"]
)
$ cat infra/docker/build_runner_child/Dockerfile
ARG BASE=infra/docker/build_runner:build_runner
FROM $BASE

RUN echo "testing" >> test_file.txt

If I tried to package the child, it correctly builds the parent first:

$ ./pants package infra/docker/build_runner_child:build_runner_child
13:13:36.34 [INFO] Completed: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_pants
13:13:36.82 [INFO] Completed: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child
13:13:36.82 [INFO] Built docker image: 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child
Docker image ID: sha256:7a88256e7eda29455dfe4338c279c1b4f6d177ce63c7268d244bfeecfbcc848c
./pants package infra/docker/build_runner_child:build_runner_child  0.87s user 0.58s system 16% cpu 8.893 total
$ docker run -it 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child /bin/bash
circleci@453c208aee5d:~/project$ ls
test_file.txt
circleci@453c208aee5d:~/project$

But if I make a modification to the Dockerfile of the parent by introducing a new layer after this first build via this modification:

$ cat infra/docker/build_runner/Dockerfile
FROM cimg/python:3.8

# hadolint ignore=DL3008,DL3004
RUN sudo apt-get update -y \
    && sudo apt-get install  --no-install-recommends -y python3-dev python3-distutils \
    amazon-ecr-credential-helper \
    awscli \
    build-essential \
    libssl-dev \
    libffi-dev \
    && sudo apt-get clean \
    && sudo rm -rf /var/lib/apt/lists/*

# hadolint ignore=DL3013
RUN pip install --no-cache-dir credstash

RUN echo "testing" >> test_file_parent.txt # here is the new layer

And then rebuild the child image, the file added in the parent image doesn't appear in the child image:

$ ./pants package infra/docker/build_runner_child:build_runner_child
13:17:29.25 [INFO] Canceled: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_pants
13:17:30.67 [INFO] Completed: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child
13:17:30.68 [INFO] Canceled: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_pants
13:17:33.68 [INFO] Completed: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_pants
13:17:33.68 [INFO] Built docker image: 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child
Docker image ID: sha256:7a88256e7eda29455dfe4338c279c1b4f6d177ce63c7268d244bfeecfbcc848c

$ docker run -it 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child /bin/bash
circleci@da7ae5dc0fad:~/project$ ls
test_file.txt
circleci@da7ae5dc0fad:~/project$

However if I run the ./pants package command a second time, then the file does appear:

$ ./pants package infra/docker/build_runner_child:build_runner_child
13:18:35.26 [INFO] Completed: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child
13:18:35.27 [INFO] Canceled: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_pants
13:18:39.77 [INFO] Completed: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_pants
13:18:39.77 [INFO] Built docker image: 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child
Docker image ID: sha256:c214db88700b60d3b0ec33d9e3bc1e43b1f5f032cb5073ac9d57d1c01ee4ad93

$ docker run -it 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child /bin/bash
circleci@1f4db35663de:~/project$ ls
test_file_parent.txt  test_file.txt
circleci@1f4db35663de:~/project$

I found this behavior highly unusual and unexpected. I'm not sure if this is an issue with my local setup, or a bug with pants, but I thought I would report it here to get some help investigating.

jsirois commented 2 years ago

Looking at 2.11.1 release notes: https://pypi.org/project/pantsbuild.pants/2.11.1/

I see https://github.com/pantsbuild/pants/pull/15300

Can you try an upgrade to 2.11.1 and see if that fixes?

jsirois commented 2 years ago

Ah, nmm! This has 0 python_distribution targets involved.

kwrobert commented 2 years ago

Interesting new development, if I instead use the instructions argument of the docker_image target to provide my image building instructions to pants, everything works as expected! Also, the logs from the ./pants package command look different, and notably do not contain any indication of commands being cancelled. Here are my new BUILD files:

$ cat infra/docker/build_runner/BUILD
docker_image(
    name="build_runner",
    image_tags=["build_runner_pants"],
    instructions=[
        "FROM cimg/python:3.8",
        "# hadolint ignore=DL3008,DL3004",
        "RUN sudo apt-get update -y \\",
        "    && sudo apt-get install  --no-install-recommends -y python3-dev python3-distutils \\",
        "    amazon-ecr-credential-helper \\",
        "    awscli \\",
        "    build-essential \\",
        "    libssl-dev \\",
        "    libffi-dev \\",
        "    && sudo apt-get clean \\",
        "    && sudo rm -rf /var/lib/apt/lists/*",
        "# hadolint ignore=DL3013",
        "RUN pip install --no-cache-dir credstash",
        "RUN echo 'testing' >> test_file_parent.txt",
        "RUN echo 'testing' >> test_file_parent2.txt",
        "RUN echo 'testing' >> test_file_parent3.txt",
        "RUN echo 'testing' >> test_file_parent4.txt",
    ],
    source="",
)
$ cat infra/docker/build_runner_child/BUILD
docker_image(
    name="build_runner_child",
    image_tags=["build_runner_child"],
    source="",
    instructions=[
        "ARG BASE=infra/docker/build_runner:build_runner",
        "FROM $BASE",
        "RUN echo 'testing' >> test_file.txt"
    ]
)

And the results of packaging the child:

$ ./pants package infra/docker/build_runner_child:build_runner_child
09:28:44.75 [INFO] Initializing scheduler...
09:28:45.00 [INFO] Scheduler initialized.
09:28:49.90 [INFO] Completed: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_pants
09:28:50.67 [INFO] Completed: Building docker image 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child
09:28:50.67 [INFO] Built docker image: 137296740171.dkr.ecr.us-west-2.amazonaws.com/build-system-demo-pants:build_runner_child
Docker image ID: sha256:af4613f1ebb8933e097140f7e9e7d4d6e112da453d5a5fabaaa4b569ce1323c5

And the files appear as expected in the child image after a single invocation of the packaging command. Notice how there is no cancellation of build steps in the logs there (in contrast to when I have external Dockerfiles defined outside the build file)?

This makes me think there must be some sort of race condition in the way pants parses Dockerfiles? I know pants will cancel a build task (as it does when using external Dockerfiles) if it detects the files consumed by a build target are modified on disk while it is executing a goal that builds that target. Maybe something like that is happening here, i.e pants is somehow indicating to itself that one of the Dockerfiles was modified by parsing the file.

kaos commented 2 years ago

fyi @pantsbuild/maintainers I'll not be able to investigate this currently.

benjyw commented 2 years ago

Related to https://github.com/pantsbuild/pants/issues/16356 ?

thejcannon commented 2 years ago

I slightly suspect it's a timing issue, due to the instructions aspect. The daemon might not be receiving the file mod notification before the command is issued?

Doesn't seem related to the other issue to me, because the second run works. Unless the second run happened to restart the daemon

benjyw commented 2 years ago

Hi @kwrobert thanks for the bug report, and sorry it's taken a while to get to triaging it.

I reproduce this locally on HEAD of the main branch, so this is definitely a Pants bug of some kind. Will dig further.

benjyw commented 2 years ago

Interesting distinction between runs that do update the parent image and those that don't:

When the parent image is NOT updated, the console logging shows:

statler:[~/src/pantsdockertest][main]$ ./pants_from_sources --pantsd package infra/docker/build_runner_child:build_runner_child
16:49:35.46 [INFO] Canceled: Building docker image build_runner:build_runner_pants
16:49:36.39 [INFO] Completed: Building docker image build_runner_child:build_runner_child
16:49:36.39 [INFO] Canceled: Building docker image build_runner:build_runner_pants
16:49:38.25 [INFO] Completed: Building docker image build_runner:build_runner_pants
16:49:38.25 [INFO] Built docker image: build_runner_child:build_runner_child

Then the subsequent run that does update the parent image has this:

statler:[~/src/pantsdockertest][main]$ ./pants_from_sources --pantsd package infra/docker/build_runner_child:build_runner_child
16:50:05.53 [INFO] Completed: Building docker image build_runner_child:build_runner_child
16:50:05.53 [INFO] Canceled: Building docker image build_runner:build_runner_pants
16:50:06.27 [INFO] Completed: Building docker image build_runner:build_runner_pants
16:50:06.27 [INFO] Built docker image: build_runner_child:build_runner_child

Note that the former has one extra Canceled: Building docker image build_runner:build_runner_pants line. Not sure what this means.

Related, it's weird that we show "Completed" for processes that we "Canceled".

benjyw commented 2 years ago

See here for a repo and README that fully reproduces this: https://github.com/benjyw/dockerbug

benjyw commented 2 years ago

I think what's happening is that the two processes are speculatively re-run on their old inputs. Then the parent gets canceled and re-run because its inputs have changed, but the child isn't because no explicit input to it has changed: the BuiltPackage for the parent image doesn't change because we don't actually put anything in the output digest...

benjyw commented 2 years ago

Although we do update the image id in the artifact names, so maybe that's not working properly somehow

benjyw commented 2 years ago

No, that's not it. We do update the imageid correctly in the BuiltPackage.

benjyw commented 2 years ago

OK, yep, the problem is that the actual input to the child image build process is a DockerBuildContext, and that does not reflect the changes to the parent image. The creation of DockerBuildContext regenerates the parent image as a byproduct, but its identity doesn't affect DockerBuildContext's hash...

benjyw commented 2 years ago

Or more precisely, its tendrils into the child building Process.

benjyw commented 2 years ago

Fixed by #16419

benjyw commented 2 years ago

Thanks for the bug report and for your patience, @kwrobert!

kwrobert commented 2 years ago

Thanks for banging out this bug fix! What's the release plan for getting this into mainline? Would love to pull it down and deploy it into my CI systems.

Eric-Arellano commented 2 years ago

Once it's merged, you can use PANTS_SHA: https://www.pantsbuild.org/docs/manual-installation#running-pants-from-unreleased-builds. It will go out in the next 2.14 dev release, and will also be cherry-picked for the next 2.13 rc :)