Prevent extra authentication round trips in non-compliant registries

ktarplee commented 10 months ago

I have been testing with JFrog Artifactory and oras cp will receive a 401 for each and every access to the repo that it has to go resolve by following the auth challenge and then use the received token (actually it is the same as stored in the auth cache already but under a different scoped key) to make the request and succeed. While the current behavior works, it is obviously inefficient. crane and docker use a simplified host-based caching (not scope based caching like oras) so they do not have an issue in this specific case.

The discussion over on slack has lots of details.

shizhMSFT commented 10 months ago

There is a security concern on sending mismatching-scope token to the remote server. Consider a registry owned by an organization and back by two teams:

flowchart LR
  Client --> Gateway
  subgraph Registry
    Gateway -- /v2/repo-a/ --> team_a[Team A]
    Gateway -- /v2/repo-b/ --> team_b[Team B]
  end

With the host-based auth (HostCache or RobustCache), the client will send the token for repo-a to the team B if the client accesses repo-a first then repo-b. If the team B is compromised, the attacker can obtain the token to access repo-a in addition to repo-b, which is pretty bad.

Readers may find that the above attack is not effective as observed by many CLI tools such as crane. It is because the context is limited to a single repository per process execution. There is no scenario for a CLI process to access multiple repositories within the process execution lifecycle with the same memory cache (mount scenario is still considered as one repo since it only sends requests to the target repo). However, developers may access multiple repositories concurrently using oras-go since it is a library.

In addition, sending tokens with mismatching-scope can confuse the users as well as the on-call DRIs of the registry service providers if there is a real auth failure. Server log will show scope mismatch instead of the actual auth failure reason, making the auth failure issue harder to diagnose.

Therefore, I don't think it is a good idea to support non-compliant registries in oras-go with security traded off. Nonetheless, we can continue discuss if we want to optimize the auth round trips for non-compliant registry in the oras CLI.

ktarplee commented 10 months ago

@shizhMSFT We might need to discuss this in a meeting...

crane a.k.a. go-cotnainerregistry, is a program and a library. For example, cosign uses go-containerregistry as a library.

"It is because the context is limited to a single repository per process execution" is not true. crane cp has a source and destination repo passed as arguments. So crane can access two repositories on the same registry. But maybe I do not understand what you mean by "context" in your sentence. In the same vane is not crane cp a counter example to your statement "There is no scenario for a CLI process to access multiple repositories within the process execution lifecycle with the same memory cache"?

Some clarifying questions:

"the client will send the token for repo-a to the team B " makes me think you are thinking of a registry as more than just a host (or set of HA hosts) but instead each repository within the registry is run by different hosts by different teams but are accessible until the same host? I do not know of a registry implementation that works that way. There seems to be no benefit for a team to run their own repo within an organization but I am probably missing something. There might be a benefit or them running their own registry.
Are you claiming that crane has a security vulnerability? Or more specifically, that any tool using non-scoped auth (e.g., go-containerregistry) and accesses multiple repositories from the same registry in the same process has a security vulnerability?

The security model nearly always used to justify token based authorization is that any token provided to a host must not give that host any extra permissions than they already have. Therefore, passing a token to the wrong host is very bad. Neither HostCache or RobustCache violate this property if you think of the host as literally the host name. Tokens are restricted to the host and never sent to a different host.

Now if that token happens to be sent, unwrapped, to the repo-a or repo-b that is a security issue with the Gateway. Likewise if the Gateway decides to send your traffic (along with your token) to a different organization, then that is also an issue with the Gateway.

The security model that you are using is tighter that the above in that you want to authenticate with a repository and not a registry. It seems to me that you are thinking of a use case where the gateway is just nginx/istio and then repo-a and repo-b and two separate and complete registries run by separate teams masquerading as a single unified registry (for some unknown reason) to the client. So auth would be delegated to repo-a or repo-b server. They might even use different authentication providers I suppose.

I believe the ~/.docker/config.json can have different entries for reg.example.com/repo-a and reg.example.com/repo-b so it might be possible to send different authentication credentials to each auth provider. Note, currently oras login does not support oras login reg.example.com/repo-a.

shizhMSFT commented 10 months ago

@ktarplee A meeting is more efficient I think. We can chat a bit via Slack when both of us are available.

crane is interesting program and a library. When crane is used as a library, the context is limited to a single function call unless building from low-level methods. For instance, the following code

package main

import (
    "crypto/sha256"
    "fmt"
    "net/http"
    "sync/atomic"

    "github.com/google/go-containerregistry/pkg/crane"
)

func main() {
    tr := &transport{}

    resolve("docker.io/library/alpine:latest", crane.WithTransport(tr))
    resolve("docker.io/library/alpine:latest", crane.WithTransport(tr))
    resolve("docker.io/library/nginx:latest", crane.WithTransport(tr))
    resolve("docker.io/library/alpine:latest", crane.WithTransport(tr))
}

func resolve(ref string, opts ...crane.Option) {
    desc, err := crane.Get(ref, opts...)
    if err != nil {
        panic(err)
    }
    fmt.Println(ref, "-->", desc.Digest)
}

type transport struct {
    count atomic.Int32
}

func (t *transport) RoundTrip(req *http.Request) (*http.Response, error) {
    count := t.count.Add(1) - 1
    var authSum []byte
    if auth := req.Header.Get("Authorization"); auth != "" {
        sum := sha256.Sum256([]byte(auth))
        authSum = sum[:]
    }
    fmt.Printf("[#%d] %s %s\n  SHA256(Authorization): %x\n", count, req.Method, req.URL, authSum)
    return http.DefaultTransport.RoundTrip(req)
}

produces the following output

[#0] GET https://index.docker.io/v2/
  SHA256(Authorization):
[#1] GET https://auth.docker.io/token?scope=repository%3Alibrary%2Falpine%3Apull&service=registry.docker.io
  SHA256(Authorization):
[#2] GET https://index.docker.io/v2/library/alpine/manifests/latest
  SHA256(Authorization): 910b8ea61aef04a22764a28afbf08b99c18e4420da93752a70382929e41813dd
docker.io/library/alpine:latest --> sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48
[#3] GET https://index.docker.io/v2/
  SHA256(Authorization):
[#4] GET https://auth.docker.io/token?scope=repository%3Alibrary%2Falpine%3Apull&service=registry.docker.io
  SHA256(Authorization):
[#5] GET https://index.docker.io/v2/library/alpine/manifests/latest
  SHA256(Authorization): 9468c0620abcce89e882f0eb5148d3b4dda273244f35e073cb72f2c16df8c7b8
docker.io/library/alpine:latest --> sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48
[#6] GET https://index.docker.io/v2/
  SHA256(Authorization):
[#7] GET https://auth.docker.io/token?scope=repository%3Alibrary%2Fnginx%3Apull&service=registry.docker.io
  SHA256(Authorization):
[#8] GET https://index.docker.io/v2/library/nginx/manifests/latest
  SHA256(Authorization): 4d7b9d16a466c86303743e705a1317134bbeb0d0539982334402f5e6a6a9fc00
docker.io/library/nginx:latest --> sha256:10d1f5b58f74683ad34eb29287e07dab1e90f10af243f151bb50aa5dbb4d62ee
[#9] GET https://index.docker.io/v2/
  SHA256(Authorization):
[#10] GET https://auth.docker.io/token?scope=repository%3Alibrary%2Falpine%3Apull&service=registry.docker.io
  SHA256(Authorization):
[#11] GET https://index.docker.io/v2/library/alpine/manifests/latest
  SHA256(Authorization): e9169cba3527bf74b51c9c2d0665833c93f81221afce24da05924daa5dfc840f
docker.io/library/alpine:latest --> sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48

The cache is not shared across function calls. Therefore, the overall behavior of the library works the same as the CLI.

The Copy functionality works with two repositories. However, internally, crane creates a puller and a pusher for each Copy call where each puller / pusher is scoped to a single repository (puller for source and pusher for destination). Therefore, the cache is never used across repositories. It is very secure 😄.

My security model is indeed tighter as I assume that

There is a company which builds a jumbo registry like DockerHub. Basically, it has a single domain name like docker.io.
A gateway is used for partitioning and defending various network attacks from the wild Internet.
Individual team handles AuthN and AuthZ as they don't trust each other although they are in the same company. However, they trust the team who controls the gateway.

I know the above the scenario is kind of artificial but it is possible (maybe someone has already build that). The reason why I made up this scenario is that I want to explore what if the client sends a valid token with mismatched scope to a remote registry with levelled security.

Moreover, I agree that HostCache or RobustCache is secure if we use one cache per NewRepository() just like crane but I hesitate to add them by considering security by default.

BTW, I've tried

$ docker login shizhtest.azurecr.io/hello-world
Username: shizhtest
Password:
WARNING! Your password will be stored unencrypted in /home/shizh/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
$ cat /home/shizh/.docker/config.json
{
        "auths": {
                "shizhtest.azurecr.io": {
                        "auth": "<removed>"
                }
        }
}

docker login does not support log into a specific repository.

ktarplee commented 10 months ago

I am working on submitting an issue to JFrog to have them fix their auth scopes in Artifactory. I hope they are willing to conform to the spec.

ktarplee commented 10 months ago

@shizhMSFT Thanks for the response. I understand your point of view now. So given your tighter security model (BTW, is that model documented in any spec?) I am wondering if we can still fix this issue client side.

To review, when making a request to https://artifactory.example.com/v2/foo-docker/test/cp5/manifests/sha256:3fbc632167424a6d997e74f52b878d7cc478225cffac6bc977eedfe51c7f4e79 we receive a 401 with "Www-Authenticate": "Bearer realm=\"https://artifactory.example.com/v2/token\",service=\"artifactory.example.com\",scope=\"repository:test/cp5:pull\"" and so the repository scope is "test/cp5" but is should be "foo-docker/test/cp5". Obviously when the auth client makes the request to get the token it needs to use what is provided in the "www-authenticate" header. However, that is not what needs to be stored. We just made a request to "foo-docker/test/cp5" so why not store the scope using that as a source of truth instead of what the registry provides in the www-authenticate header (which is in correct in this case). In other words we have two source of truth for the repository information. One that is always correct (it is in the request) and one that is often correct (in the www-authenticate header). This would shield us from Gateway issues (e.g., URL re-writing) or bugs in the registry. Would this be an acceptable solution.

Are there any cases where the registry would be allowed to return a repository scope that is something other than the repository in the URL?

I wonder if there are any security concerns (using your security model and current oras-go behavior) if Team B returns an auth challenge to /v2/repo-b that has a repository auth scope set to repo-a. Would Team A then be provided the token for Team B when making a request to /v2/repo-a/?

I considered this initially but it is a little harder to implement with the abstraction provided to the auth client.

shizhMSFT commented 10 months ago

is that model documented in any spec

Nope. It is not documented anywhere.

Would this be an acceptable solution.

Yes, it is one solution as I've thought about it when I first design the auth module for oras-go v2. However, I don't have an elegant design to cover all cases (e.g. the mounting case where there are two scopes in a token). Precisely, the auth.Client is not able to know the context of the caller and thus the caller needs to pass the context (i.e. the scope for caching to the auth.Client). The reason why I choose the current design is that it is most compatible and optimized for compliant registries while it is workable but not optimized for non-compliant registries if someone wants to extend oras-go/v2 to support more registry features. For instances,

Docker Content Trust
- curl "https://notary.docker.io/v2/docker.io/library/foo/_trust/tuf/root.json" -v
- www-authenticate: Bearer realm="https://auth.docker.io/token",service="notary.docker.io",scope="repository:docker.io/library/foo:pull"
Registry vendor extensions (e.g. Listing Manifests in ACR)
- curl https://shizhtest.azurecr.io/acr/v1/myrepo/_manifests -v
- Www-Authenticate: Bearer realm="https://shizhtest.azurecr.io/oauth2/token",service="shizhtest.azurecr.io",scope="repository:myrepo:metadata_read"

@ktarplee The point of the source of truth is interesting. Maybe we can leverage the scope hint and use it as the token cache key instead of parsing from the URL path? We can keep the current behavior if there is no scope hint. /cc @Wwwsylvia

Are there any cases where the registry would be allowed to return a repository scope that is something other than the repository in the URL?

The blob mounting case is one scenario. For instance,

curl -XPOST "https://registry-1.docker.io/v2/library/foo/blobs/uploads/?from=library/bar&mount=sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a" -v

Outputs:

www-authenticate: Bearer realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/bar:pull repository:library/foo:pull,push"

I wonder if there are any security concerns (using your security model and current oras-go behavior) if Team B returns an auth challenge to /v2/repo-b that has a repository auth scope set to repo-a. Would Team A then be provided the token for Team B when making a request to /v2/repo-a/?

That's a good catch. I just realized the above mounting case demonstrates that repo-b can let the client to request tokens for repo-a, which is legit. I think my security model is too tight and indeed to be relaxed.

ktarplee commented 10 months ago

@shizhMSFT I just discovered that oras exhibits the same re-authentication behavior for NVIDIA's nvcr.io as it does for JFrog Artifactory. nvcr.io seems to be a nginx proxy in front and might be mis-configured. Specifically, if you try to oras cp --debug nvcr.io/nvidia/k8s-device-plugin:v0.14.1 $TEST_REGISTRY you will see that we get 25 responses that have status 401 with exactly the same challenge, namely "Www-Authenticate": "Bearer realm=\"https://nvcr.io/proxy_auth\",scope=\"repository:nvidia/k8s-device-plugin:pull,push\"". I think the issue here is slightly different than with JFrog. The repository is correct however the action is not what oras expects. I think oras expects it to be pull (nvcr.io is the source repo in the copt) but instead the action is pull,push.

ktarplee commented 10 months ago

There is another strange behavior of nvcr.io that I observed. If you give it the wrong bearer token (one from a different repository) it responds with Www-Authenticate: Bearer realm="https://authn.nvidia.com/token",service="registry",scope="repository:nvidia/k8s-device-plugin:pull",error="insufficient_scope". Notice that the host name and URL for the realm are different than the above post. When you make the request to https://authn.nvidia.com/token to go get this token it always returns 401 and the operation fails because that requires actual credentials (I suspect). So using the host-based auth cache in this case causes the copy to fail because nvcr.io does not quietly ignore invalid credentials. Instead it does something different if you send an invalid authentication header (one for a different repository on the same registry) than what it does if no authentication header is set at all.

shizhMSFT commented 10 months ago

@ktarplee That are interesting findings. It seems there are lots of registries in the world do not implement the auth correctly. Maybe that's why there is an wg-auth group for standardizing it.

The case of giving a wrong bearer token (with a different repository) to nvcr.io is again interesting as the realm should be https://nvcr.io/proxy_auth. In that case, the cache cannot be used across repositories.

If we want to optimize the auth flow for non-compliant registries, host-based cache is not sufficient. The callers have to set constraints on the cache that the cache must be used within a single repository (I'm not sure if it impacts the Mount case).

How about introduce MonoCache instead of RobustCache? Basically, MonoCache has the same implementation of RobustCache but we clearly document that the MonoCache can only be used for a single context (e.g. operations within a single repository or just the registry-level operations) for non-compliant registries (and document a list of non-compliant registries).

ktarplee commented 10 months ago

@shizhMSFT In the case of having a unique auth cache for each "context"... What benefit would the scoped cache provide over the hostCache (where we only use the hostname as the cache key)? I think the answer is that you might have a registry provide different tokens with different scopes (pull vs push) and caching them all might help save a few round trips. The problem there might be that the scope is incorrect (e.g., artifactory and nvcr.io have this issue) so the cache would be ineffective if keyed on scope. There is no way to reliably know that a registry is non-conformant which makes it hard for a consumer of oras-go to pick the correct auth cache implementation.

Getting at the same thing from a different angle... What would be the easiest way to have oras-go behave similarly to how go-containerregstry (crane) behaves for auth? Would that be a fresh hostCache for each repository context. By context here we mean, pulling from reg/repo1 would be a different context then pushing to reg/repo1.

shizhMSFT commented 10 months ago

Would that be a fresh hostCache for each repository context. By context here we mean, pulling from reg/repo1 would be a different context then pushing to reg/repo1.

@ktarplee Yes, that's also what I am thinking about. We just need a proper name for this kind of cache so that people will not mis-use or abuse it.

HostCache?
MonoCache?
ContextCache?
OneTimeCache?
RepositoryCache?
ScopedCache?
Others?

I think ContextCache sounds a good name.


// NewContextCache creates a host-based cache for optimizing the auth flow for
// non-compliant registries.
// This cache cannot be shared and is intended to be used in a single context,
// such as pulling from a single repository.
//
// Note: [NewCache] should be used for compliant registries as it can be shared
// across context.
func NewContextCache() Cache {
    panic("not implemented")
}

ktarplee commented 9 months ago

I reached out to JFrog and they opened a ticket to try and fix this issue.

Wwwsylvia commented 9 months ago

@ktarplee This is nice. Thank you for the efforts!

oras-project / oras-go

Prevent extra authentication round trips in non-compliant registries #650