Open woodruffw opened 4 years ago
For those that aren't as familiar with TUF, a few additional questions about this:
/simple
or just /simple/projectname
?Yep! Thanks for the clarifying questions.
- what hashes should we store?
TUF is using BLAKE2 for the other target metadata (i.e., actually distribution packages), so it probably makes sense to use it here as well.
- do we need to include other metadata with the file as well? timestamp?
I don't believe so; I think just the file itself should be sufficient. @jku may be able to correct me here, if I'm missing something.
- do we need this for
/simple
or just/simple/projectname
?
Just /simple/projectname
, for TUF purposes. My understanding (again, please correct me if I'm wrong!) is that no resolvers currently use /simple
, and TUF won't be using it whatsoever.
- how would TUF metadata refer to an old index?
This is probably the knottiest part. My first thought for this is that /simple/projectname
should be mapped to /simple/{hash}-projectname
, where {hash}
is the BLAKE2 content hash. The TUF metadata would only ever refer to the {hash}-projectname
variant, ensuring that we always fetch a version of the index that's consistent with the other target metadata.
Yeah this all seems correct to me. The TUF metadata for a project index will look roughly like this:
"sampleproject": {
"custom": {},
"hashes": {
"blake2b": "7a8bc0d0a15f6289b184f86a56d398467fd465c6293535182ba0f6cc2f04e703",
},
"length": 3080
}
A client that sees this metadata will download https://pypi.org/simple/7a8bc0d0a15f6289b184f86a56d398467fd465c6293535182ba0f6cc2f04e703.sampleproject
to ensure it's getting exactly the version of sampleproject index it wants. This should work for all hashes mentioned in the metadata (but I suppose warehouse will only use blake2b).
I'll mention that we can of course agree on a different path for index files if e.g. you don't want to pollute the "/simple/*" namespace with so many new items. If we do that the path should be relative to the pypi index url though (so that the path can be found on all warehouse instances without configuration). Something like https://pypi.org/simple/.project-indexes/ would be fine to me.
Another option would be to do something like: /simple/PROJECT/HASH/
.
That would make it easy for mirrors to keep all of the related files colocated, to enable deletion and cleanup work without having to track where those files are for a specific project.
Another option would be to do something like:
/simple/PROJECT/HASH/
.
The reasoning is sound but the TUF client implementation currently expects the target name to include a filename that will then be prefixed with hash: this could of course be worked around but alternatively something like /simple/PROJECT/HASH.index.html
or
/simple/PROJECT/HASH.PROJECT
would work out of the box.
Yea those are fine with me.
Chatting with @ewdurbin today, we determined that /simple/PROJECT/HASH.PROJECT
is required since we don't actually produce any index.html
files.
A first pass at this is in #8586.
Something I did not think when we last discussed this: It might be a good idea to not use the project name in the file name itself because of filename length limits: so I would suggest something like /simple/<PROJECT>/<HASH>.index.html
.
This is not a practical issue right now (blake2b hash is 64 bytes and longest project name on pypi seems to be 80 bytes: still far from the 255 byte limit) but avoiding the potential problem seems like a good idea if doing so is painless.
Since we can't use index.html
, does /simple/<PROJECT>/<HASH>
work?
The longest is 80 characters but I'm not sure where the practical limit for this comes from, if any: https://pypi.org/project/Aaaaaaaaaaaaaaaaaaa-aaaaaaaaa-aaaaaaasa-aaaaaaasa-aaaaasaa-aaaaaaasa-bbbbbbbbbbb/
TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with <HASH>.
before doing the request. I think can workaround that assumption in pip if /simple/<PROJECT>/<HASH>
is what works for you -- but will have to check that, I'll get back to that on monday or tuesday.
I think I theoretically can workaround /simple/<PROJECT>/<HASH>
in the client (pip) code but only with an awful hack so I won't do that.
I think the reasonable options are:
/<HASH>.<FILENAME>
-- the filename does not have to be index.html, anything will work (even something dynamic like the project name although my earlier comments about filename length stand).I couldn't quite follow why 'index.html' was problematic so do let me know if the first option is not on the table: I'll have to start a discussion in TUF community in that case.
Bumping the question about index.html
-- I think I might have also missed the reason why it can't be used (either as /HASH/index.html
or HASH.index.html
).
Alternatively, would something like /simple/PROJECT/HASH.detail.html
work?
Bumping the question about
index.html
-- I think I might have also missed the reason why it can't be used (either as/HASH/index.html
orHASH.index.html
).
It could be used but it doesn't make much sense as an endpoint within our routes -- there is no index.html
file so we would sort of be hacking it in as an endpoint.
TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with
. before doing the request.
As such, this is kind of a poor assumption, because virtually all of our routes don't have "filenames", including the ones in question here (unless you consider the last part of the path a filename).
If we say that project names are constrained to a maximum of 80 characters, is there any reason why /simple/<HASH>.<PROJECT_NAME>
wouldn't work? That seems to be more aligned with TUF's desire to take the last piece of the URL and prepend <HASH>.
to it, right?
TUF client library by default assumes it's given a url that has a filename in the end: the client library then prefixes the filename with . before doing the request.
As such, this is kind of a poor assumption, because virtually all of our routes don't have "filenames", including the ones in question here (unless you consider the last part of the path a filename).
I totally agree (I can also understand how they ended up with that design -- the focus was on passive systems where the targets and metadata are pre-generated and then served by a dumb fileserver). I'm just pointing out that the URL must end with /<HASH>.<SOMENAME>
or we have to do some redesign work in the TUF client API: both options are valid.
If we say that project names are constrained to a maximum of 80 characters, is there any reason why
/simple/<HASH>.<PROJECT_NAME>
wouldn't work? That seems to be more aligned with TUF's desire to take the last piece of the URL and prepend<HASH>.
to it, right?
Sure that works.
That works for me as well! Thanks for the explanation, @di!
What's the problem this feature will solve?
As part of the TUF rollout (#7488), we will need to store hashes for the simple indices that
pip
and other resolvers use.These indices are currently generated dynamically from a template when requested, making that difficult. Instead, they should be generated once per relevant event (file upload/release) and stored somewhere (probably GCS). Stale indices should not be deleted from the store, as the TUF metadata may still refer to them.
cc @ewdurbin @dstufft