snapshot of tuf integration branch (not a real PR)

jku commented 4 years ago

Foreword

This branch is very much a work in progress (full 10% of the lines are "TODO"): please don't review details, I'm just hoping to validate (or even just communicate) the high level ideas and maybe get some new insights at that level.

My current work is a little ahead of this branch but I think this is more useful for the purposes of discussion and this branch actually works (for pip install at least)...

I don't expect you to do this but if you do want to test:

get mock warehouse from git@gist.github.com:2cd1077aab8235a3a497c233090ea7e4.git
setup as described in mock warehouse README, start the server
run pip from source https://pip.pypa.io/en/stable/development/getting-started/#running-pip-from-source-tree
pip install --index-url http://localhost:8000/simple/ sampleproject

Normal flow of the tuf-related code in "pip install sampleproject"

A dictionary of updater objects is built during initialization (currently in SessionCommandMixin).
When the dependency calculation needs an index file it ends up in LinkCollector._get_html_page(), this looks up an updater object based on the index url (currently quite unsafely), downloads the index file with tuf and returns the contents
When distribution needs to be downloaded prepare.py:get_http_url() is called. This looks up an updater object based on "comes_from" field (which is the url of the index file this distribution url was found in), and downloads the target this url refers to

Open questions on the flow

updaters are looked up with index_urls. If one is not found, that means TUF is not used for this download: instead the current download functionality (without TUF) is used. This feels fragile considering I don't have full knowledge of where the index urls come from... but I don't see other solutions.
Where to do the initialization is undecided: I think one of the CommandMixins is correct, possibly even a new one
There are loads of possibilities for when to "intercept" the index and distribution download code: the current places are the easiest but the decision should probably be based on what is least likely to break in future (so TUF support does not get accidentally turned off)
with the previous point in mind, I'm thinking I'll add a hard-coded warning/error for pypi.org: if we end up downloading things from pypi without TUF, that sounds like an error. I'm not sure same can be done for files.pythonhosted.org

Data storage

Cache is in ~/.cache/pip/. It's used as the tuf download location so contains everything ever downloaded with tuf

TUF metadata is in ~/.local/share/pip/.

Open questions on data storage

metadata directory name (per repository): currently this is a hash of index url -- could it be human readable? is that even useful?
is cache cleaning required? No good ideas about this
what to do when '--no-cache' is used? I think pip has a temp directory that could be used...
there is no way to check if local metadata is available via TUF api: I'll need to check manually I think
I think this may lead to unnecessary copying of target files -- but this may also already be the case in pip

TUF updater abstraction in pip code

This is code in src/pip/_internal/network/tuf.py. The code badly needs better naming ('Updater' and 'tuf' names are used very confusingly) -- ideas are welcome.

But the basic design is simple:

initialize_updaters() forms a dictionary of Updater objects: one object for each metadata dir found for index urls used in current pip configuration.
Updater object owns the tuf Updater object and the two mirror configurations needed (one for downloading index files, one for distribution files)
the distribution mirror config gets modified "at runtime" because we don't know what the distribution file server is before we actually get to downloading target files
Updater object offers download_distribution() and download_index() as high level functionality

So a user will first lookup the correct updater using the index_url of the repository, then call the download functions on that updater.

Open questions:

Naming!
initialize_updaters() should probably return an actual object that could offer a better API than just dictionary lookup: The awful index url parsing should be in a single place at least
_split_distribution_url(): Warehouse devs were fine with my idea of finding the target name from the download url using the knowledge that the hash is X characters long but I'm still thinking if it really is a good idea...

MVrachev commented 4 years ago

TUF updater abstraction in pip code

This is code in src/pip/_internal/network/tuf.py. The code badly needs better naming ('Updater' and 'tuf' names are used very confusingly) -- ideas are welcome.

Not perfect, but UpdaterHandler? I am using the same pattern when I manage the server processes in the tests in TUF server_handler as a variable name. Or maybe TufUpdater?

sechkova commented 4 years ago

For dummies (I tried the mock deployment in a new virtualenv):

You need to pip install securesystemslib[colors,crypto,pynacl] tuf at the very beginning or at least tuf and crypto, I didn't really try that
The shebang that worked for me in the warehouse-mock scripts was #!/usr/bin/env python

but it worked like a charm (I think)

p.s. I know about it and I was still scared by the red text:

ERROR: Could not download URL: 'http://localhost:8000/tuf/3.root.json'
Traceback (most recent call last):
...
tuf.exceptions.NoWorkingMirrorError: No working mirror was found:
  'localhost:8000': HTTPError('404 Client Error: File not found for url: http://localhost:8000/tuf/3.root.json')

jku / pip

snapshot of tuf integration branch (not a real PR) #10