jku / pip

pip fork to experiment with PEP-458 implementation https://www.python.org/dev/peps/pep-0458/: See branch tuf-v2 (and tuf-mvp and tuf-mvp-vendored for earlier work)
https://pip.pypa.io/
MIT License
1 stars 0 forks source link

resolving tuf metadata url for Warehouse index url #5

Open jku opened 4 years ago

jku commented 4 years ago

To mimimize client configuration pip should be able to find the "TUF API endpoint" (the metadata directory) without any other information than the index url that is defined in pip.conf. This relation should be part of the Warehouse API promise

Three choices I can think of:

  1. TUF metadata is at sibling directory of index url:
    urllib.parse.urljoin("https://pypi.org/simple/", "../tuf/") # 'https://pypi.org/tuf/'
    urllib.parse.urljoin("https://my-host.com/path/to/simple/", "../tuf/") # 'https://my-host.com/path/to/tuf/'
    urllib.parse.urljoin("https://no-path.com/", "../tuf/") # 'https://no-path.com/tuf/' <-- bug
  2. TUF metadata is at fixed path on same host
    urllib.parse.urljoin("https://pypi.org/simple/", "/tuf/") # 'https://pypi.org/tuf/'
    urllib.parse.urljoin("https://my-host.com/path/to/simple/", "/tuf/") # 'https://my-host.com/tuf/'
    urllib.parse.urljoin("https://no-path.com/", "/tuf/") # 'https://no-path.com/tuf/' <-- bug
  3. I guess there is a third option if there can be 'hidden' directories under the index url:
    urllib.parse.urljoin("https://pypi.org/simple/", ".tuf/") # 'https://pypi.org/simple/.tuf/'

    This way the index would be contained and easy to mirror/copy.

I am currently guessing the choice is option 1 and warehouse implementers are advised to not serve warehouse index from domain root to avoid the issue noted.

woodruffw commented 4 years ago

I think this is ultimately a question for @ewdurbin and the other PyPI admins, but my personal vote is for option 1. Hosting it at a sibling path avoids assuming that every host always has a fixed path available.

ewdurbin commented 4 years ago

I agree that a sibling path is appropriate. @dstufft @di?

dstufft commented 4 years ago

I think we probably have to do something like 3 actually? Or we need some way for a repository to indicate where it's TUF metadata is found. All of the above options would work for PyPI, but when downstream projects like DevPI get deployed, 2 is completely unworkable because they'll host multiple repositories under a single domain. Likewise 1 won't work for all cases either, because there's no requirement that the api live at /simple/, it could live at /, in which case there is no possibility for a sibling path.

The only thing we know for sure will work if we're doing something hardcoded, is something living under the root of the simple API, which pretty much means some sub directory that isn't a valid package name.

The only other option I can think of is some way to ask an URL where it's TUF metadata lives... but that gets complicated with static mirrors like bandersnatch because pretty much the only thing you can rely on is a statically defined header (so we could do a HEAD request to the root url?) or a well known static file (but that opens the question if we have .tuf-location that points to where TUF lives, what are we really gaining over just mandating it's .tuf).

So tl;dr

  1. We can't assume we have access to anything outside the "root" of the /simple/ API.
  2. Whatever we do has to be implementable by a bunch of files sitting on disk served with a web server (we can assume the web server has typical configuration options like adding headers).
jku commented 4 years ago

Likewise 1 won't work for all cases either, because there's no requirement that the api live at /simple/, it could live at /, in which case there is no possibility for a sibling path.

Of course you could advice against hosting at '/' in your re-hosting/mirroring README. There may already be mirrors/instances hosting at '/' but even for those current functionality would not be broken: they just might not be able to use tuf.

... but I do see your point and have to agree with the following:

The only thing we know for sure will work if we're doing something hardcoded, is something living under the root of the simple API, which pretty much means some sub directory that isn't a valid package name.

Having looked at some client code that last bit sounds tricky in practice. E.g. .tuf is not a valid name according to PEP-0508 but clients have to deal with distributions made before PEP-0508 (and have historically been quite laissez-faire about this sort of things)... In practice that might be fine if we make sure .tuf only contains directories (that then contain the actual metadata files)?

dstufft commented 4 years ago

Yea. If I remember correctly, as long as .tuf/ doesn't return a HTML mimetype, pip will just ignore it.

It would be useful to see what the behavior is for completely invalid in a package name character. I don't remember what it is off the top of my head, but I could image doing something like ~tuf/ or something like that which is even less likely to colide.

ewdurbin commented 4 years ago

Clients SHOULD already be parsing the simple api HTML... maybe a pointer to the TUF metadata location should be part of the HTML <head> somehow?

jku commented 4 years ago

If I remember correctly, as long as .tuf/ doesn't return a HTML mimetype, pip will just ignore it.

Oh this is very likely true. Good point, I was only thinking of the package name aspect.

dstufft commented 4 years ago

Clients SHOULD already be parsing the simple api HTML... maybe a pointer to the TUF metadata location should be part of the HTML somehow?

Yea I mentioned something along those lines. It's workable, just kind of weird I think? The way TUF works is we're going to have TUF validate the fetch of the /simple/ page.. so we'd do this weird thing where we pull it down, ask it how to validate itself, then go fetch that to validate it. Not the end of the world (I think it's still secure) just kind of awkard.

The other awkward part of that is which response do we put it on? In theory it makes the most sense on /simple/ itself.. but that response is huge and modern clients don't actually fetch that page. So we'd probably want to put it on every page.. but I don't think that actually works? Well it does, but we basically lose TUF's protection on non existent packages (since they wouldn't have a response to have something in the </head> unless we did something weird like do the resolving until we find a package that exists, then backtrack and validate all of our responses up until that point.. which probably makes that a non starter (and opens the question of what if 100% of the packages don't exist?).

So I think if we're using some pointer to where the TUF metadata lives, it would have to be in a singular location, that a client could fetch before doing resolution, and given the problems with /simple/ that's probably a header on /simple/ so we can do a HEAD request, or some well known location (we could theoritcally make it more generic and do something like .well-known/tuf-meta.json or something (well known).

jku commented 4 years ago

Making sure we're on the same page: there are two different decisions here:

So client not finding TUF metadata on server does not mean TUF is disabled: just that the metadata may not get updated.

I don't quite understand what this means:

we basically lose TUF's protection on non existent packages

I plan to only do anything with TUF (even updating metadata) once there is a distribution URL that needs to be downloaded -- this is to avoid refreshing metadata when it's not needed.

dstufft commented 4 years ago

Doesn't accessing /simple/<foo>/ also require invoking TUF?

ewdurbin commented 4 years ago

I really like the idea of using .well-known 👍, especially given that it just so happens to not be a valid project name.

jku commented 4 years ago

Doesn't accessing /simple/<foo>/ also require invoking TUF?

The package index HTML will not be verified by TUF, only the actual distribution files -- this is my understanding, @woodruffw can verify.

ewdurbin commented 4 years ago

Upon closer inspection, it is not clear if .well-known is allowed anywhere but off of the root URI... so we are probably breaking spec if it lives at https://pypi.org/simple/.well-known

Edit: It is not. Section 3 states:

Well-known URIs are rooted in the top of the path's hierarchy; they are not well-known by definition in other parts of the path. For example, "/.well-known/example" is a well-known URI, whereas "/foo/.well-known/example" is not.

dstufft commented 4 years ago

The package index HTML will not be verified by TUF, only the actual distribution files

I'm pretty sure we lose a significant portion of the security promises of TUF if we do that, unless some other mechanism has been added, I think it's also a deviation from PEP 458 (well PEP 458 doesn't specify what installers must do, but it does indicate /simple/ pages should be TUF targets as well).

Upon closer inspection, it is not clear if .well-known is allowed anywhere but off of the root URI... so we are probably breaking spec if it lives at https://pypi.org/simple/.well-known

We could resolve that by doing /.well-known/tuf-meta.json, and have that contain a URI template that can be combined with the base url of the repository, to allow templated locations which would still support all of the use cases above... just adding the constraint that the repository must be able to put something at the root URL, and that the location for TUF must be expressable as a URI template.

I dunno, I'm personally a fan of just saying $APIBASE/.tuf/ or $APIBASE/~tuf/, but if we want to do the well known route I still think it's workable.

ewdurbin commented 4 years ago

I think that going with well-known is ideal. I think it's a reasonable ask of maintainers of compliant mirrors. Perhaps we should do a very public ask?

Something like "Maintainers of PyPI mirrors! Do you host your mirror at a sub directory like /pypi/ or /simple/? Is serving a file from /.well-known/ not feasible for some reason? Let us know!" from @PyPI @ThePSF @ThePyPA

dstufft commented 4 years ago

To be clear, looking at https://theupdateframework.com/security/ I think if we're only validating the distrubtion files, we lose:

Unless we've started using the TUF metadata instead of the /simple/ metadata for dependency resolution.. but it doesn't sound like that's the case due to

once there is a distribution URL that needs to be downloaded

and it would also be in violation of PEP 458/503.

dstufft commented 4 years ago

I think that going with well-known is ideal. I think it's a reasonable ask of maintainers of compliant mirrors. Perhaps we should do a very public ask?

Something like "Maintainers of PyPI mirrors! Do you host your mirror at a sub directory like /pypi/ or /simple/? Is serving a file from /.well-known/ not feasible for some reason? Let us know!" from @pypi @ThePSF @ThePyPA

Should be fine to do that ask, might also be worthwile asking cooper and uh.. whoever is maintaining DevPI these days how they feel about that solution.

jku commented 4 years ago

I'm pretty sure we lose a significant portion of the security promises of TUF if we do that, unless some other mechanism has been added, I think it's also a deviation from PEP 458

You seem to be correct, I've missed that! This is very good to hash out now... I've worked with williams Warehouse branch and I'm pretty sure that does not handle simple indexes at the moment.

I'll spend a bit of time thinking on this (and apparently re-reading the pep) and get back to you on this.

dstufft commented 4 years ago

I'll make sure I'm on the call tomorrow incase it's easier to sort it out in a higher bandwidth medium.

jku commented 4 years ago

I'll make sure I'm on the call tomorrow incase it's easier to sort it out in a higher bandwidth medium.

I might not have been invited to that one: I am not aware of a call... Email is jkukkonen@vmware.com in case my presence would be helpful (and if timing works for UTC+3).

dstufft commented 4 years ago

Your email address is on the invite list already it appears, it would be in about 7.5 hours or so?

jku commented 4 years ago

Huh. I've found the original invite email, it's just not on my calendar... Thanks for mentioning it, I'll be there

woodruffw commented 4 years ago

The package index HTML will not be verified by TUF, only the actual distribution files -- this is my understanding, @woodruffw can verify.

This was my plan originally, but on closer reading of the PEP:

When updating bin-n metadata for a consistent snapshot, the snapshot process SHOULD also include any new or updated hashes of simple index pages in the relevant bin-n metadata. Note that, simple index pages may be generated dynamically on API calls, so it is important that their output remains stable throughout the validity of a consistent snapshot.

This is slightly annoying to handle, but shouldn't be impossible. It does, however, substantially increase the fragility of TUF target metadata w/r/t inconsequential changes to the simple index (e.g., in the unlikely event of a small typo or necessary HTML change, we'd need to backfill every single target).

woodruffw commented 4 years ago

It also means that the initial TUF repository setup includes another lengthy generation period, where we ask Warehouse to render the simple index for each project and hash it. I also don't think this is a dealbreaker, just something we'll need to account for.