thoth-station / solver

Dependency solver for the Thoth project
https://thoth-station.ninja/
GNU General Public License v3.0
20 stars 12 forks source link

Solver should report exact package hash that was used to install a package #5102

Open fridex opened 2 years ago

fridex commented 2 years ago

Is your feature request related to a problem? Please describe.

Currently, Thoth provides all the artifact hashes in the lockfile that were found on the index and it lets the pip installation procedure pick the suitable artifact. Instead, Thoth should point to an exact Python artifact that should be used during the installation process to make sure proper auditing is done.

Describe the solution you'd like

Gregory-Pereira commented 2 years ago

/assign @Gregory-Pereira

goern commented 2 years ago

/priority important-soon

Gregory-Pereira commented 2 years ago

I am not too familiar with solver, so pardon my questions as I get up to speed. When you say Thoth should point to an exact Python artifact that should be used during the installation process, do you mean that a hash should be computed for that result of solver linking the OS, python version, and resulting dependency versions? Or would this python artifact refer to every individual dependency?

fridex commented 2 years ago

Check TensorFlow wheels published on PyPI as an example - https://pypi.org/project/tensorflow/2.7.0/#files

There are macos, windows and manylinux builds specific for some Python version (ex. Python 3.7, 3.8, 3.9). As of now we point users to tensorflow==2.7.0 from PyPI and provide all the artifact hashes (so that pip picks the right build on client side). In an ideal scenario, Thoth should give back just one hash pointing to specific artifact that should be used to install tensorflow==2.7.0. That can be, for example, tensorflow-2.7.0-cp39-cp39-manylinux2010_x86_64.whl if users run linux and use Python 3.9 (and x86_64 arch).

Gregory-Pereira commented 2 years ago

So we would save a bunch of these hashes that correspond to the a specific version of a package, the OS and python version on the Thoth server / API side? I guess whats confusing me about this is how does that fit in with the rest of resolver? For instance my understanding is that solver allows you to pass in some package version and constraints. ex:

cycler>=0.10.
kiwisolver==1.2.
matplotlib<=3.2.1
numpy==1.18.5

(note this example is completely made up I don't know if there is a solution for these dependencies / versions).

It will then recursively resolve all dependencies and transitive dependencies that would work for these rules. So I understand how that would work if we are looking a specified package version, but what about cycler, and matplotlib in this example? Would it then just recursively try all the versions of those two that meet the requirement, and then for each of those first look for this version specific hash that we are discussing?

fridex commented 2 years ago

So we would save a bunch of these hashes that correspond to the a specific version of a package, the OS and python version on the Thoth server / API side?

We already have them on Thoth server side (Thoth is a cloud/server side resolver). The think is that we miss the OS+python version linkage.

I guess whats confusing me about this is how does that fit in with the rest of resolver? For instance my understanding is that solver allows you to pass in some package version and constraints. ex:

cycler>=0.10.
kiwisolver==1.2.
matplotlib<=3.2.1
numpy==1.18.5

(note this example is completely made up I don't know if there is a solution for these dependencies / versions).

It will then recursively resolve all dependencies and transitive dependencies that would work for these rules. So I understand how that would work if we are looking a specified package version, but what about cycler, and matplotlib in this example? Would it then just recursively try all the versions of those two that meet the requirement, and then for each of those first look for this version specific hash that we are discussing?

The resolver is using tenporal difference learning (so no "recursive tries" per say). We use this "solver" component to aggregate information about packages for the resolver itself - so solver will just get corresponding hashes more accuratelly that are subsequently used by the server-side resolver.

Gregory-Pereira commented 2 years ago

So for each dependency as its getting installed I am able to grab its SHA256. However the way pip does its hashes is per file in said package. Now not all files in a package may have a SHA. I stuck with selinon as one of my examples, when in pipenv shell I ran: ./thoth-solver --verbose python -r 'selinon==1.0.0' -o solver-output-selinon-1.0.0-darwin.json. While still in the shell I navigated to where the site packages were installed: ~/.local/share/virtualenvs/solver-CEbDbFsW/lib/python3.8/site-packages/.

Located in this folder there were two folders related to selinon, selinon/ and the distribution information selinon-1.0.0.dist-info/. The first of these of course held all the files that made up the package and the other held all the details related to its distribution. The RECORD file here held showed the files installed in the package and their SHA values if they had any, ex:

...
selinon/caches/__pycache__/lifo.cpython-38.pyc,,
selinon/caches/__pycache__/lru.cpython-38.pyc,,
selinon/caches/__pycache__/mru.cpython-38.pyc,,
selinon/caches/__pycache__/rr.cpython-38.pyc,,
selinon/caches/fifo.py,sha256=bZxu6sh_EelPUqSp6clYbTPjSrcc4Ok52AFHHc90aAA,2679
selinon/caches/lifo.py,sha256=-Db8LACEHNtO2-magPmdxWLDgNoy2_NXxAOWppRJecE,694
selinon/caches/lru.py,sha256=q6o2uyvMZoz8I0y1z0L3saHxzRKQDGbawnRPw8B3c5g,4338
selinon/caches/mru.py,sha256=6Y7MO1KX8rysPuFnsm8lbyxEvVQo3PlFxznHqrUVByQ,655
selinon/caches/rr.py,sha256=_trRSY5aHAE5AynRJERUAGTVBXoynzTAi8LuK8QyPRU,2458
selinon/celery.py,sha256=pkl7o7g-GyLMVvSN4OG7GDONM3RbpjT5-vaqf6GpnP0,1212
selinon/cli.py,sha256=8Ll6xGKaejY-YNe2Q2dt11vA6KDIxDt8xGwlpACLJBE,17332
selinon/codename.py,sha256=j6348ZPG6ml31-4v15Fj6jXneSjgneSSvkN3T_TQccs,38
selinon/config.py,sha256=EAEFVe0Is1iZZ7jFh2BEUhEttTF4E69WBYzTA73Plho,14363
selinon/data_storage.py,sha256=pFdIPwl5EJiVcFX1xo3XODmKNsSehMrz1d2mbHTYdzw,3147
selinon/dispatcher.py,sha256=StskSf2EbUq75Mtu-qJgvgOi6qZplrcuiXEnWwV7x2Q,9739
selinon/edge.py,sha256=bGZRrKdC1PKIsQ4LAYyuIuK6rlAeR0ticioZB-PBi_M,10162
selinon/errors.py,sha256=znvu-WKToPWa0UZbYHS_l3s3Xyxdd9qBbLS68iMm24c,5295
selinon/executor/__init__.py,sha256=4--nBjb69cDYdX9xvN9_maTXPq13Ki1zg1zOq-nmyS0,95
...

I thought there would be a way to grab a single hash for a package, but Im not sure I am looking in the right place, maybe this would be located somewhere on PYPI, but I haven't found it yet. Maybe I will need to save all the individual hashes or import some other library or package to use such as pip-compile but wanted to ask here first if there is a better strategy.

I plan to use this to build this out on the result object:

"thoth-wheels": [
      {
        "pyperclip-1.8.2-darwin-21.2.0-x86_64": "105254a8b04934f0bc84e9c24eb360a591aaf6535c9def5f29d92af107a9bf57"
      }
]

It will have this format as an object with a key of ----, and value of the package SHA

Let me know if I am missing or misunderstanding anything.

fridex commented 2 years ago

Nice research.

Sadly, these hashes will not be part of the artifacts as the artifact hash is computed based on the artifact content, which makes it a chicken-egg problem.

As of now, we obtain all the artifact hashes in this function:

https://github.com/thoth-station/solver/blob/28601d1dc22bcc9556ccce0a7fd6fddf6c5dcc75/thoth/solver/python/python.py#L212-L222

Ideally, thoth-solver could perform pip install for each artifact with hash:

python3 -m pip install --no-deps --no-cache-dir pyperclip==1.8.2 --hash=sha256:XYZ

here:

https://github.com/thoth-station/solver/blob/28601d1dc22bcc9556ccce0a7fd6fddf6c5dcc75/thoth/solver/python/python.py#L97

A brute-force approach would try all the artifacts and pip should report that the artifact is not suitable for the runtime environment. That fact can become part of the report. If the artifact is installable, thoth-solver can report its dependencies.

Gregory-Pereira commented 2 years ago

So quick update. Firstly the only way I could successfully use the hashes when installing a pip package was to stuff it into some requirements file (I am using temp-requirements.txt) and provide the hash there, and then call it from the thoth-solver with the --require-hashes flag. It currently works, but this may create a little bit more overhead.

Second, of the list of SHA package hashes, sometimes multiple can actually work, like if the PYPI package provides a .whl and source .gz distribution. Because both can be successfully installed with pip, both could be considered "the correct version", and so I think we should pass both to the resolver. However, if desired, we could use some other criteria to ascertain the better package if we only want one solution stored on the thoth-server side (per package/package-version/system/system-distribution) such as package size, version number, etc. Currently however, this is how the Thoth-wheels are looking:

"thoth-wheels": {
      "pyroaring-0.3.3-darwin-21.2.0-x86_64": [
        {
          "pyroaring-0.3.3-cp39-cp39-macosx_10_14_x86_64.whl": "399730714584ec47b05978cc00b737478a10e2a6a8fed94d886fd0b25c522b05"
        },
        {
          "pyroaring-0.3.3.tar.gz": "232bf4cbdd7a1dad885171d9d7e59da5324b3d70c15a96a240f1319b870b46b7"
        }
      ]
    }

Is this acceptable? or should I try to resolve it as only one package, and if so what criteria should be used?

As for context to the next two points these are the packages and respective versions I have been using to test the thoth-solver:

selinon == 1.0.0
pyperclip == 1.8.2
pyroaring == 0.3.3
pytorch == 1.0.2
tensorflow == 2.7.0

With brute force solution to testing which SHAs work with which environment, I am running into issues for bigger packages (selinon and tesorflow). I ran my local feature branch version of the thoth-solver today in the background for selinon 1.0.0 (with transitive dependencies) and it didn't finish after about 40 minutes. I am going to be see what I can do in the way of efficiency today.

Also when testing I encountered a potential issue. This was specifically for the Pytorch package, so I am not certain this will apply to other packages. Its install instructions listed on PYPI are pip install pytorch, or pip install pytorch==1.0.2 for the specific package version, however it fails install, because, as it says on the PYPI package page, You tried to install “pytorch” The package named for PyTorch is “torch”. Is this normal for PYPI packages to be renamed in a manner such as this, and if so is this something we should support in Solver, or is this an edge case?

KPostOffice commented 2 years ago

Are the hashes available in the package index warehouse useful for this problem at all? See: https://pypi.org/pypi/tensorflow/json.

goern commented 2 years ago

moin all, any progress on this? is https://github.com/thoth-station/solver/pull/5110#discussion_r798006711 the blocker? @fridex could you work on it?

Gregory-Pereira commented 2 years ago

So I am not sure if this is a valid solution to address Frido's comment, but I was thinking about adding the -vv flag to the pip install command and parsing the hash directly out of the resulting stdout (see code). Would love to get other's opinions on this, as it doesn't seem very robust, but this would be the solution with the lowest overhead, as the resulting wheel information comes directly from the one and only install command. Was this what Frido was talking about when he meant it would live in the part of the code "that does the actual dependency extraction"?

I also looked a bit into what Kevin was saying as well about the pypi package index warehouse. I am not certain this would be useful to us because we already already store the hash of every artifact per the release we are using, however we are attempting to ascertain which artifact is the best per given package, package version, and environment information (os, distro, etc.) and persist it on the Thoth side. We could take a pretty decent guess from the warehouse json for instance that for release 2.0.0, artifact index 6 with the filename of "tensorflow-2.0.0-cp36-cp…anylinux2010_x86_64.whl" would work for any linux distro with x86_64 architecture and python3.6, however we really should test that it installs properly and not just take an educated guess based on filename. Since that is the case this functionality really should come from the install command rather than an endpoint.

goern commented 2 years ago

@fridex is this something to move forward?

/sig stack-guidance

Gregory-Pereira commented 2 years ago

I was told that Thoth-Station is making a priority of stabilizing the system before introducing new changes, and so this might hang for a bit. /lifecycle frozen

codificat commented 2 years ago

Based on the history so far, my understanding is that this is /triage accepted but /priority important-longterm /remove-priority important-soon