pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 952 forks source link

Pre-generate and serve simple index metadata #8487

Open woodruffw opened 4 years ago

woodruffw commented 4 years ago

What's the problem this feature will solve?

As part of the TUF rollout (#7488), we will need to store hashes for the simple indices that pip and other resolvers use.

These indices are currently generated dynamically from a template when requested, making that difficult. Instead, they should be generated once per relevant event (file upload/release) and stored somewhere (probably GCS). Stale indices should not be deleted from the store, as the TUF metadata may still refer to them.

cc @ewdurbin @dstufft

di commented 4 years ago

For those that aren't as familiar with TUF, a few additional questions about this:

woodruffw commented 4 years ago

Yep! Thanks for the clarifying questions.

  • what hashes should we store?

TUF is using BLAKE2 for the other target metadata (i.e., actually distribution packages), so it probably makes sense to use it here as well.

  • do we need to include other metadata with the file as well? timestamp?

I don't believe so; I think just the file itself should be sufficient. @jku may be able to correct me here, if I'm missing something.

  • do we need this for /simple or just /simple/projectname?

Just /simple/projectname, for TUF purposes. My understanding (again, please correct me if I'm wrong!) is that no resolvers currently use /simple, and TUF won't be using it whatsoever.

  • how would TUF metadata refer to an old index?

This is probably the knottiest part. My first thought for this is that /simple/projectname should be mapped to /simple/{hash}-projectname, where {hash} is the BLAKE2 content hash. The TUF metadata would only ever refer to the {hash}-projectname variant, ensuring that we always fetch a version of the index that's consistent with the other target metadata.

jku commented 4 years ago

Yeah this all seems correct to me. The TUF metadata for a project index will look roughly like this:

   "sampleproject": {
    "custom": {},
    "hashes": {
     "blake2b": "7a8bc0d0a15f6289b184f86a56d398467fd465c6293535182ba0f6cc2f04e703",
    },
    "length": 3080
   }

A client that sees this metadata will download https://pypi.org/simple/7a8bc0d0a15f6289b184f86a56d398467fd465c6293535182ba0f6cc2f04e703.sampleproject to ensure it's getting exactly the version of sampleproject index it wants. This should work for all hashes mentioned in the metadata (but I suppose warehouse will only use blake2b).

I'll mention that we can of course agree on a different path for index files if e.g. you don't want to pollute the "/simple/*" namespace with so many new items. If we do that the path should be relative to the pypi index url though (so that the path can be found on all warehouse instances without configuration). Something like https://pypi.org/simple/.project-indexes/ would be fine to me.

dstufft commented 4 years ago

Another option would be to do something like: /simple/PROJECT/HASH/.

That would make it easy for mirrors to keep all of the related files colocated, to enable deletion and cleanup work without having to track where those files are for a specific project.

jku commented 4 years ago

Another option would be to do something like: /simple/PROJECT/HASH/.

The reasoning is sound but the TUF client implementation currently expects the target name to include a filename that will then be prefixed with hash: this could of course be worked around but alternatively something like /simple/PROJECT/HASH.index.html or /simple/PROJECT/HASH.PROJECT would work out of the box.

dstufft commented 4 years ago

Yea those are fine with me.

di commented 3 years ago

Chatting with @ewdurbin today, we determined that /simple/PROJECT/HASH.PROJECT is required since we don't actually produce any index.html files.

A first pass at this is in #8586.

jku commented 3 years ago

Something I did not think when we last discussed this: It might be a good idea to not use the project name in the file name itself because of filename length limits: so I would suggest something like /simple/<PROJECT>/<HASH>.index.html.

This is not a practical issue right now (blake2b hash is 64 bytes and longest project name on pypi seems to be 80 bytes: still far from the 255 byte limit) but avoiding the potential problem seems like a good idea if doing so is painless.

di commented 3 years ago

Since we can't use index.html, does /simple/<PROJECT>/<HASH> work?

The longest is 80 characters but I'm not sure where the practical limit for this comes from, if any: https://pypi.org/project/Aaaaaaaaaaaaaaaaaaa-aaaaaaaaa-aaaaaaasa-aaaaaaasa-aaaaasaa-aaaaaaasa-bbbbbbbbbbb/

jku commented 3 years ago

TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with <HASH>. before doing the request. I think can workaround that assumption in pip if /simple/<PROJECT>/<HASH> is what works for you -- but will have to check that, I'll get back to that on monday or tuesday.

jku commented 3 years ago

I think I theoretically can workaround /simple/<PROJECT>/<HASH> in the client (pip) code but only with an awful hack so I won't do that.

I think the reasonable options are:

I couldn't quite follow why 'index.html' was problematic so do let me know if the first option is not on the table: I'll have to start a discussion in TUF community in that case.

woodruffw commented 3 years ago

Bumping the question about index.html -- I think I might have also missed the reason why it can't be used (either as /HASH/index.html or HASH.index.html).

Alternatively, would something like /simple/PROJECT/HASH.detail.html work?

di commented 3 years ago

Bumping the question about index.html -- I think I might have also missed the reason why it can't be used (either as /HASH/index.html or HASH.index.html).

It could be used but it doesn't make much sense as an endpoint within our routes -- there is no index.html file so we would sort of be hacking it in as an endpoint.

TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with . before doing the request.

As such, this is kind of a poor assumption, because virtually all of our routes don't have "filenames", including the ones in question here (unless you consider the last part of the path a filename).

If we say that project names are constrained to a maximum of 80 characters, is there any reason why /simple/<HASH>.<PROJECT_NAME> wouldn't work? That seems to be more aligned with TUF's desire to take the last piece of the URL and prepend <HASH>. to it, right?

jku commented 3 years ago

TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with . before doing the request.

As such, this is kind of a poor assumption, because virtually all of our routes don't have "filenames", including the ones in question here (unless you consider the last part of the path a filename).

I totally agree (I can also understand how they ended up with that design -- the focus was on passive systems where the targets and metadata are pre-generated and then served by a dumb fileserver). I'm just pointing out that the URL must end with /<HASH>.<SOMENAME> or we have to do some redesign work in the TUF client API: both options are valid.

If we say that project names are constrained to a maximum of 80 characters, is there any reason why /simple/<HASH>.<PROJECT_NAME> wouldn't work? That seems to be more aligned with TUF's desire to take the last piece of the URL and prepend <HASH>. to it, right?

Sure that works.

woodruffw commented 3 years ago

That works for me as well! Thanks for the explanation, @di!