peterbe / hashin

Helping you write hashed entries for packages in your requirements.txt
https://www.peterbe.com/plog/hashin
MIT License
105 stars 27 forks source link

Order of hashes no longer canonical #105

Closed SomberNight closed 5 years ago

SomberNight commented 5 years ago

Since version 0.14, the order of hashes does not seem to be deterministic. Before 0.14, the hashes used to be lexicographically ordered.

I am guessing this was changed in https://github.com/peterbe/hashin/pull/95

What is the reason for this?

We use hashin in electrum for deterministic builds. The requirements files are generated by this script. The freeze_packages.sh script is manually rerun to update dependencies (see e.g. https://github.com/spesmilo/electrum/commit/ba4af29bf82f7addb68c3ed996002665313ab223). If the order of hashes for a given package is not canonical, then rerunning the script can result in changes for packages that have not actually been updated, making the diff considerably less readable.

peterbe commented 5 years ago

The original issue was: https://github.com/peterbe/hashin/issues/93

The reasoning was that when the code that determines whether to edit the list of hashes kick in, it should be agnostic of order. Because, the order of them does not matter to pip. Also, it doesn't really make sense to sort something like that.

Also, I don't understand why you point out this commit. There's no change of order of hashes there.

peterbe commented 5 years ago

By the way, I couldn't help but to notice...

for requirement in $requirements; do
    echo -e "\r  Hashing $requirement..."
    $other_python -m hashin -r $contrib/deterministic-build/requirements${i}.txt ${requirement}
done

Can, in theory, be replaced with:

echo -e "\r  Hashing $requirements..."
$other_python -m hashin -r $contrib/deterministic-build/requirements${i}.txt ${requirements}

Referring to this.

SomberNight commented 5 years ago

Also, I don't understand why you point out this commit. There's no change of order of hashes there.

Yes, there is no change there. I just linked to that as an example of rerunning the script. I have just locally rerun the script again, and noticed this ordering issue now.

The reasoning was that when the code that determines whether to edit the list of hashes kick in, it should be agnostic of order. Because, the order of them does not matter to pip. Also, it doesn't really make sense to sort something like that.

If pip is agnostic to it, it might as well be canonical. :) It makes sense to sort it for the reason I describe, to make diffs that update these lists less cluttered.

Thanks for the code cleanup suggestion.

peterbe commented 5 years ago

Actually, I think the order is deterministic and it's based on PyPI. What I mean is that it downloads, for example, https://pypi.org/pypi/hashin/json and based on the info it figures out the version and then whichever order they releases are in that version, in the JSON, is the order it writes it to the requirements file.

E.g. when you run

▶ curl https://pypi.org/pypi/hashin/json | jq '.releases["0.14.1"]'
[
  {
    "comment_text": "",
    "digests": {
      "md5": "29526146d5c8d4401bd74b5b329613ea",
      "sha256": "5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1"
    },
    "downloads": -1,
    "filename": "hashin-0.14.1-py3-none-any.whl",
    "has_sig": false,
    "md5_digest": "29526146d5c8d4401bd74b5b329613ea",
    "packagetype": "bdist_wheel",
    "python_version": "py3",
    "requires_python": ">=2.7,!=3.0,!=3.1,!=3.2,!=3.3",
    "size": 15387,
    "upload_time": "2018-12-13T01:29:32",
    "url": "https://files.pythonhosted.org/packages/da/f7/fe7093f1a4bed62f34000c710130ee15a99638e46717f18d0281a6cacff9/hashin-0.14.1-py3-none-any.whl"
  },
  {
    "comment_text": "",
    "digests": {
      "md5": "c7bc5c943853fc192bc01471bbcfa771",
      "sha256": "43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b"
    },
    "downloads": -1,
    "filename": "hashin-0.14.1.tar.gz",
    "has_sig": false,
    "md5_digest": "c7bc5c943853fc192bc01471bbcfa771",
    "packagetype": "sdist",
    "python_version": "source",
    "requires_python": ">=2.7,!=3.0,!=3.1,!=3.2,!=3.3",
    "size": 20941,
    "upload_time": "2018-12-13T01:29:34",
    "url": "https://files.pythonhosted.org/packages/72/9e/971798d201c1a67d66c343d36c335708c4c4585e5afda04e3c2bb046e91c/hashin-0.14.1.tar.gz"
  }
]

the order is

  1. 5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1
  2. 43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b

and that is always the case. So, what you get in your requirements file is something like this:

--- Old
+++ New
@@ -0,0 +1,3 @@
+hashin==0.14.1 \
+    --hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1 \
+    --hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b

What https://github.com/peterbe/hashin/issues/93 was about was being smarter about "merging" the hashes. For example, suppose I take that diff above, but manually, in my editor, rearrange the order so it looks like this:

▶ cat /tmp/reqs.txt
hashin==0.14.1 \
    --hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1 \
    --hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b

▶ code /tmp/reqs.txt

▶ cat /tmp/reqs.txt
hashin==0.14.1 \
    --hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b \
    --hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1

(note that the only change is the order of the hashes).

Now, if I were to run hashin again...:

▶ python hashin.py hashin -r /tmp/reqs.txt --dry-run

You get an empty output because the hashes have NOT changed according to hashin. It ignores the fact that the order is "wrong".

Basically, as long as you don't manually "mess" with the order of the hashes, it is deterministic. Always. And even if you did mess with the order of the hashes hashin won't try to fix that.

So, no the order is not canonical but it's certainly deterministic as long as PyPI.org is.

mythmon commented 5 years ago

I think that Hashin should take a stronger approach here than "whatever PyPI.org does". Consistency is useful, both for deterministic builds and for smaller, easier to read diffs. I think that sorting the hashes in the file anytime Hashin writes them goes a long way towards that end.

Deterministic build are important because of projects like Debian's ReproducibleBuilds, which strive to increase trust and verifiability of packages. These are some of the same goals that hashing packages in Python is intended to solve.

Another tool that is similar to Hashin also uses sorting to help consistency: Yarn. It sorts entries both in package.json and also yarn.lock. It doesn't ever have lists of hashes, but it sorts dependencies alphabetically when listing them.

peterbe commented 5 years ago

@di Do you have an opinion on this one? Should the requirements file that hashin builds sort the hashes lexicographically ordered, no matter what, or should it sort them by the same way that PyPI presents them?

I would argue that it's a no-problem because the only time it comes up is if you compare two different requirements files made in two different places when PyPI might have returned files in a different order. E.g.

# On machine X
$ hashin Django -r reqs-x.txt
# On machine Y
$ hashin Django -r reqs-y.txt
$ rsync reqs-y.txt machineX:
# On machine X
# This *will* diff if the order of files 
# on https://pypi.org/pypi/Django/json changed over time. 
$ diff reqs-x.txt reqs-y.txt

Contrast with doing it all and always on the same machine:

# Monday
$ hashin Django==2.1.5 -r /path/to/reqs.txt
# Tuesday
# This will ONLY diff if the SET of hashes has changed between Monday and Tuesday
$ hashin Django==2.1.5 -r /path/to/reqs.txt --dry-run

Essentially, what hashin does when it figures out to bother changing the requirements file is this:

if set(hashes_from_parsing_existing_file) != set(hashes_from_pypi_download):
    amend_requiments_file(hashes_from_pypi_download)
SomberNight commented 5 years ago

if you compare two different requirements files made in two different places when PyPI might have returned files in a different order

In Electrum's case, the file is not always generated by the same developer. I think it's unrealistic to assume in general that frozen dependency files are always updated from the same machine.

di commented 5 years ago

@peterbe Thanks for looping me in.

I would expect the hashes to be sorted lexicographically.

The ordering from PyPI doesn't have much meaning to users: it is ordered by the filename of the distribution artifact. Since hashin doesn't include filenames in it's output, this ordering wouldn't make much sense to a user trying to understand why hashin insists on it.

peterbe commented 5 years ago

I get the feeling that all y'all people are saying that it would be good to NOT rely on the sorting of PyPI and instead always write down the hashes lexicographically sorted. Yes, it has the advantage that the files created a more predictable when compared as a whole.

I don't think the existing logic that determines whether it should amend the hashes should change but when it does it should sort the hashes before writing them.

SomberNight commented 5 years ago

Yes I think that's a good solution. To restate, if the set of hashes already in the file and the new set of hashes are equal, it's okay to leave the file as is (ignore order when comparing, detect as no change); but if the sets are unequal, what gets written should be lexicographically sorted.

peterbe commented 5 years ago

Anybody willing to take a look at https://github.com/peterbe/hashin/pull/110 ?

The unit tests plus my extensive manual testing (in the PR description) makes me pretty confident already but, you know, more eyes etc.

peterbe commented 5 years ago

This is now out in version 0.14.5.