shipwright-io / build

Shipwright - a framework for building container images on Kubernetes
https://shipwright.io
Apache License 2.0
667 stars 113 forks source link

Build controller handling of service account can lead to failing build runs and other artifacts #337

Open SaschaSchwarze0 opened 4 years ago

SaschaSchwarze0 commented 4 years ago

In a build run spec, there is a section about the service account. This section can be

  1. set to a specific name and generate = false to make use of the specified service account
  2. set to generate = true to generate a service account specific to the build run
  3. omitted to get the default behavior which is to use the pipeline service account or - if that does not exist - the default service account

When the generated service account is not used, our logic in generate task run will add the secrets referenced in the build (for example for the source repository and the target container registry) to the specified service account if the secret is not already contained.

If the user now deletes one of those secrets, the service account is broken and our controller will also not repair it. If the user now tries to create a build run that uses such a service account, then the build run will fail with this reason: failed to create task run pod "new-build-run-kwkl5-j9b27": translating TaskSpec to Pod: secrets "something" not found. Maybe invalid TaskSpec As a service account can be used for other means as well (especially if scenario 3 applies and a fallback to default happens), those other things might also fail (job, deployment) (I should mention that I did not try how they behave with such a service account).

Here are some options that I see:

(A) Watch secret deletions and remove references from the service account(s). This would probably be possible if we would only support either a hard-coded service account or a generated one. But given the user can use a named service account for his build runs, this could mean that we need to manage a lot service accounts.

(B) At the time we assign a named service account to a build run, we check whether all secrets really exist and if one does not exist we fail the build run with an error message that is nicer than the one coming from Tekton.

(C) At the time we assign a named service account to a build run, we check whether all secrets really exist and if one does not exist we remove it from the service account.

(D) We do nothing and only document the current behavior.

In addition to those options, we may also consider a configuration setting for the controller that allows to force a build run to use a generated service account by ignoring the intent of the user specified in the build run.

SaschaSchwarze0 commented 4 years ago

My preference:

I'd rule our (B) and (D) alone because they require the user to fix something. (B) or (D) would only be okay if the addition (config setting to force generated service account) is also implemented.

Imo (A) would be quite complex because it would not just mean to watch secret deletions but for a full solution, one would also need to take care of the build controller restart where we would need to then validate all service accounts. The nice thing of this option on the other hand is that it does not affect the performance when a new build run is there and we have to create the task run because we there can assume that all service accounts are okay. But that's imo not worth from an effort-value perspective

My preference is (C) because it still means that the user does not need to fix something manually. And I hope the impact on the task run creation performance is not too high.

zhangtbj commented 4 years ago

For C as example,

Do you mean use a named service account such as buildrun-sa instead of pipeline or default sa? We use pipeline sa because OpenShift uses it.

I may prefer to switch the default behavior to generate = true and document the problem you may meet if you set generate = false

SaschaSchwarze0 commented 4 years ago

Do you mean use a named service account such as buildrun-sa instead of pipeline or default sa? We use pipeline sa because OpenShift uses it.

No, with C I mean the following: we do not change the behavior of which service account is taken. That logic would stay (though, the hard-coded pipeline name is worth an env variable, but anyway). What would be done in addition in maybe this function, https://github.com/redhat-developer/build/blob/08b80ac29535a922d75d64c1f4f212591c7a4c44/pkg/controller/buildrun/credentials.go#L14-L37 or maybe rather in a separate function, is to look at the overall set of secrets in the service account and verify that they exist (either by a client.Get for each one or maybe more efficiently by doing a client.List in the namespace). And for those, that do not exist, we would remove them from the service account.

adambkaplan commented 4 years ago

When the generated service account is not used, our logic in generate task run will add the secrets referenced in the build (for example for the source repository and the target container registry) to the specified service account if the secret is not already contained.

Is this something that we need to do? For a very long time now the k8s default behavior is if a secret exists in a namespace, it is readable by any service account in the namespace. The only secrets directly linked to Service accounts are image pull secrets and service account tokens, the latter of which are automatically generated by k8s. For image pull secrets, we only need to link pull secrets that are needed to pull pod images, not images referenced in Dockerfiles and such.

SaschaSchwarze0 commented 4 years ago

When the generated service account is not used, our logic in generate task run will add the secrets referenced in the build (for example for the source repository and the target container registry) to the specified service account if the secret is not already contained.

Is this something that we need to do? For a very long time now the k8s default behavior is if a secret exists in a namespace, it is readable by any service account in the namespace. The only secrets directly linked to Service accounts are image pull secrets and service account tokens, the latter of which are automatically generated by k8s. For image pull secrets, we only need to link pull secrets that are needed to pull pod images, not images referenced in Dockerfiles and such.

Hi @adambkaplan, we use secrets as part of a build run for two purposes where k8s has not its hands on:

  1. To clone a git source which requires authentication, this is handled by Tekton and the documentation states that a service account is necessary: Configuring ssh-auth authentication for Git
  2. As image pull secrets for the FROM image when running kaniko and as image push secret to push the final image to a container registry. I think we here also rely on Tekton to make the secrets available to our steps. Tekton describes docker secrets here which includes the service account usage: Configuring docker* authentication for Docker.

The magic to create the necessary files within the tasks home directory (/home/tekton, an emptyDir volume) happens inside the creds-init init container. It is setup by credsInit which only considers the secrets that are associated in the service account.

In short: yes, we have to do this.

qu1queee commented 4 years ago

@SaschaSchwarze0 sorry, im digesting the msgs, will provide my 2 cents soonish

qu1queee commented 4 years ago

My opinion is that we do not need to do anything, because of the following:

  1. This use-case is only valid when a new BuildRun is created while the referenced Build already exists. If the end-user will apply first the Build while the secret does not exists, then the user will see the following:

    $ k -n test-build get build
    NAME                     REGISTERED   REASON                             BUILDSTRATEGYKIND      BUILDSTRATEGYNAME   CREATIONTIME
    buildpack-nodejs-build   False        secret icr-docker does not exist   ClusterBuildStrategy   buildpacks-v3       4s

    plus the BuildRun will not start due to the False Registered.

  2. Tekton is already doing the validation for us and returning an error. As you mentioned in:

The magic to create the necessary files within the tasks home directory (/home/tekton, an emptyDir volume) happens inside the creds-init init container. It is setup by credsInit which only considers the secrets that are associated in the service account.

In sequential order, what Tekton does is:

so Im not seeing the value on doing what Tekton already does.

  1. The Tekton error for me is intuitive , it say secrets "something" not found .

  2. In case it fails, the BuildRun will properly show the error coming from Tekton

  3. A pod is never created, no resources are wasted.

My proposal is that this is not a bug and that we should not do anything in code, but delegate to tekton the validation(which is already being done). We can add a documentation in this project around the use-case of this issue and what does the error represented by tekton means in simple words.

SaschaSchwarze0 commented 4 years ago

Thank you for your opinion @qu1queee. I in the meantime think, we should go with another option. These two I consider:

(E) Remove fallback to pipeline and default service account. Instead we require the user to provide a service account name that exists (and can specify default, if he wants). If the service account section in the build run is not specified, then the new default behavior will be to use a generated service account (F) Remove support for named service account at all, remove service account section from build run, always use generated service account.

The reason for this is that I learned in the meantime (in a KubeCon talk) that a taskrun will fail if two secrets in the service account are for the same resource (same container registry, same git endpoint). The reason is that Tekton's creds-init step cannot handle this and the reason is obvious: the Docker secrets all need to go into one file, the Docker config.json. In that config.json, the container registry host is the key of the object. This does not allow two credentials to be passed in.

I'm always in favor of providing a solution that works for the user without much to take care for him. The generated service account seems to be the best choice in my opinion. One could go this the all-or-nothing way and completely remove the support for named service accounts (F) which would also simplify the CRD and code as we do not need to implement both options. Or one still allows the service account usage as an opt-in (E).

qu1queee commented 4 years ago

Remove fallback to pipeline and default service account. Instead we require the user to provide a service account name that exists (and can specify default, if he wants). If the service account section in the build run is not specified, then the new default behavior will be to use a generated service account

This is not ideal for openshift. My understanding is that in openshift the pipeline sa already have authentication to the openshift registry, so its good to be able to fall back to that SA when users dont specify anything. This use case also means we cannot do F). Regarding the default SA, because this SA is in all k8s namespaces I also see value on leaving it there.

The reason for this is that I learned in the meantime (in a KubeCon talk) that a taskrun will fail if two secrets in the service account are for the same resource (same container registry, same git endpoint). The reason is that Tekton's creds-init step cannot handle this and the reason is obvious: the Docker secrets all need to go into one file, the Docker config.json. In that config.json, the container registry host is the key of the object. This does not allow two credentials to be passed in.

I´m not sure about the above, I tried to validate this use case. I basically have two secrets with the same content(for dockerhub) but different name, and I can see that the credential-initializer initcontainer have no problem with them:

    initContainers:
    - args:
      - -docker-config=icr-knbuild
      - -docker-config=icr-docker
      - -docker-config=icr-docker2
      command:
      - /ko-app/creds-init
      env:
      - name: HOME
        value: /tekton/home
      image: gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/creds-init:v0.14.2@sha256:64a511c0e68f611f630ccbe3744c97432f69c300940f4a32e748457581394dd6
      imagePullPolicy: IfNotPresent
      name: credential-initializer
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /workspace
        name: tekton-internal-workspace
      - mountPath: /tekton/home
        name: tekton-internal-home
      - mountPath: /tekton/results
        name: tekton-internal-results
      - mountPath: /tekton/creds-secrets/icr-knbuild
        name: tekton-internal-secret-volume-icr-knbuild-gptgx
      - mountPath: /tekton/creds-secrets/icr-docker
        name: tekton-internal-secret-volume-icr-docker-tsqlb
      - mountPath: /tekton/creds-secrets/icr-docker2
        name: tekton-internal-secret-volume-icr-docker2-bbsll
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-4rl9h
        readOnly: true

@SaschaSchwarze0 can you validate the above?

I think what I´m saying is that we should leave the code as it is.

SaschaSchwarze0 commented 4 years ago

@SaschaSchwarze0 can you validate the above?

Yeah, later, but the error should happen here: https://github.com/tektoncd/pipeline/blob/v0.14.2/pkg/credentials/dockercreds/creds.go#L77-L79

SaschaSchwarze0 commented 4 years ago

@SaschaSchwarze0 can you validate the above?

Yeah, later, but the error should happen here: https://github.com/tektoncd/pipeline/blob/v0.14.2/pkg/credentials/dockercreds/creds.go#L77-L79

I agree with your observation @qu1queee and consider this a Tekton bug. Tekton allows container registry credentials to be of these formats (see Configuring authentication for Docker):

  1. with a .dockerconfig field in the data that contains the base64 encoded config.json file
  2. with username and password fields and a tekton.dev/docker-0 annotation for the registry host

We typically (or me actually always) take the first option. But, the code that I had pointed to is for the second option and only this option prevents duplicate credentials for the same registry. The other one is missing that checks here and here. That'll become my first Tekton PR. :-)

zhangtbj commented 4 years ago

I also agree this is not a bug, but we can improve the behavior or document the potential problem.

We keep the pipeline sa as default because of the openshift use case. Sascha, you can confirm with Shoubhik or Otávio, if openshift doesn't require the default pipeline sa, I think we are all fine to remove the dedicated sa and just using the generated sa.

So that is why I suggest change the generate=true and document the problem if end user switch to use generate=false

And we tested the multiple container registries in one SA before, we can have different container registries in one SA, such as one is docker.io and another is icr.io, it works. But if the two secrets are from docker.io, the build will fail because tekton doesn't know which one it should use, so we should avoid this case or document somewhere.

qu1queee commented 4 years ago

So that is why I suggest change the generate=true and document the problem if end user switch to use generate=false

@zhangtbj yes, I think we should do this as you said. Can you open a new issue for that? and work on it if you have time or ask someone in CDL :)

@SaschaSchwarze0 can we then close this one in favor of the new upcoming issue from @zhangtbj ?

SaschaSchwarze0 commented 4 years ago

So that is why I suggest change the generate=true and document the problem if end user switch to use generate=false

@zhangtbj yes, I think we should do this as you said. Can you open a new issue for that? and work on it if you have time or ask someone in CDL :)

@SaschaSchwarze0 can we then close this one in favor of the new upcoming issue from @zhangtbj ?

Sry, was focused on other stuff the last days.

I think we need to validate our ideas with our RH friends. My understanding currently is that we plan to run the build service differently in our offerings:

With such conflicting interests, I wonder whether this can only be resolved using a configuration option that controls the default behavior. Would be similar to what the e2e tests do = one passes an environment variable to the operator which is the name of the service account, the special value generated means to not use a fixed service account, but the generated one by default. This behavior would be applied as long as no serviceAccount section is defined in the buildrun spec.

qu1queee commented 4 years ago

@SaschaSchwarze0 I think there is no need for this:

would be similar to what the e2e tests do = one passes an environment variable to the operator which is the name of the service account, the special value generated means to not use a fixed service account, but the generated one by default.

in my opinion this is adding complexity on the code and I think the SA management should only belong to the Build or BuildRun definition.

But you have a good point, if we auto-generate the SA always by default then we need to:

So that is why I suggest change the generate=true and document the problem if end user switch to use generate=false

We should close this issue and open a new one where we discussed the current topic. This issue original intention is not longer valid.

SaschaSchwarze0 commented 4 years ago

We should close this issue and open a new one where we discussed the current topic. This issue original intention is not longer valid.

I am not somebody who likes these kind of discussions as they consume unnecessary time. The issue I describe in this issue is perfectly valid. And the intention is to prevent users to come to service accounts that contain deleted secrets. Changing the default to using a generated service account is a perfect solution to mitigate that problem as it gives the responsibility of deciding to take a dedicated user account into the hands of the user who specifies this in the buildrun. And combined with a certain documentation on what it means to use a dedicated service account (secrets being added to it) and a hint that the user is then responsible to remove secrets from it if he decides to delete one, imo is a perfect resolution of exactly what I want to address.

davidberglund commented 3 years ago

I tried to get started with Shipwright since yesterday but I'm stuck with what I assume to be the issue described here.

k logs myapp123-lrkrd-pod-cprfn error: a container name must be specified for pod myapp123-lrkrd-pod-cprfn, choose one of: [step-source-default step-build-and-push] or one of the init containers: [place-tools working-dir-initializer]

Source code and registry is on a private self hosted Gitlab instance. Using kaniko strategy. I can confirm that using a public git repo works as expected (except for the incorrect timestamp on the image, it says created 40yrs ago :) Now, have you documented anywhere how to get passed the above error? Thanks! Great project!

SaschaSchwarze0 commented 3 years ago

I tried to get started with Shipwright since yesterday but I'm stuck with what I assume to be the issue described here.

k logs myapp123-lrkrd-pod-cprfn error: a container name must be specified for pod myapp123-lrkrd-pod-cprfn, choose one of: [step-source-default step-build-and-push] or one of the init containers: [place-tools working-dir-initializer]

Source code and registry is on a private self hosted Gitlab instance. Using kaniko strategy. I can confirm that using a public git repo works as expected (except for the incorrect timestamp on the image, it says created 40yrs ago :) Now, have you documented anywhere how to get passed the above error? Thanks! Great project!

Hi @davidberglund, thank you for the feedback. Your question is not related to this issues. The best place for questions is our slack channel, #shipwright, in the Kubernetes slack, or to open a new issue.

Regarding your question: BuildRun pods consist of multiple containers. As such, you need to run kubectl logs using either the -c argument and the container name, or using --all-containers. The BuildRun's status should also guide you specifically to the command for the failed step. Information about how to authenticate to a private Git repository can be found here: https://github.com/shipwright-io/build/blob/main/docs/development/authentication.md#authentication-for-git.