Closed SomberNight closed 5 years ago
The original issue was: https://github.com/peterbe/hashin/issues/93
The reasoning was that when the code that determines whether to edit the list of hashes kick in, it should be agnostic of order. Because, the order of them does not matter to pip
. Also, it doesn't really make sense to sort something like that.
Also, I don't understand why you point out this commit. There's no change of order of hashes there.
By the way, I couldn't help but to notice...
for requirement in $requirements; do
echo -e "\r Hashing $requirement..."
$other_python -m hashin -r $contrib/deterministic-build/requirements${i}.txt ${requirement}
done
Can, in theory, be replaced with:
echo -e "\r Hashing $requirements..."
$other_python -m hashin -r $contrib/deterministic-build/requirements${i}.txt ${requirements}
Referring to this.
Also, I don't understand why you point out this commit. There's no change of order of hashes there.
Yes, there is no change there. I just linked to that as an example of rerunning the script. I have just locally rerun the script again, and noticed this ordering issue now.
The reasoning was that when the code that determines whether to edit the list of hashes kick in, it should be agnostic of order. Because, the order of them does not matter to pip. Also, it doesn't really make sense to sort something like that.
If pip is agnostic to it, it might as well be canonical. :) It makes sense to sort it for the reason I describe, to make diffs that update these lists less cluttered.
Thanks for the code cleanup suggestion.
Actually, I think the order is deterministic and it's based on PyPI. What I mean is that it downloads, for example, https://pypi.org/pypi/hashin/json and based on the info
it figures out the version and then whichever order they releases are in that version, in the JSON, is the order it writes it to the requirements file.
E.g. when you run
▶ curl https://pypi.org/pypi/hashin/json | jq '.releases["0.14.1"]'
[
{
"comment_text": "",
"digests": {
"md5": "29526146d5c8d4401bd74b5b329613ea",
"sha256": "5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1"
},
"downloads": -1,
"filename": "hashin-0.14.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "29526146d5c8d4401bd74b5b329613ea",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=2.7,!=3.0,!=3.1,!=3.2,!=3.3",
"size": 15387,
"upload_time": "2018-12-13T01:29:32",
"url": "https://files.pythonhosted.org/packages/da/f7/fe7093f1a4bed62f34000c710130ee15a99638e46717f18d0281a6cacff9/hashin-0.14.1-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "c7bc5c943853fc192bc01471bbcfa771",
"sha256": "43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b"
},
"downloads": -1,
"filename": "hashin-0.14.1.tar.gz",
"has_sig": false,
"md5_digest": "c7bc5c943853fc192bc01471bbcfa771",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=2.7,!=3.0,!=3.1,!=3.2,!=3.3",
"size": 20941,
"upload_time": "2018-12-13T01:29:34",
"url": "https://files.pythonhosted.org/packages/72/9e/971798d201c1a67d66c343d36c335708c4c4585e5afda04e3c2bb046e91c/hashin-0.14.1.tar.gz"
}
]
the order is
5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1
43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b
and that is always the case. So, what you get in your requirements file is something like this:
--- Old
+++ New
@@ -0,0 +1,3 @@
+hashin==0.14.1 \
+ --hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1 \
+ --hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b
What https://github.com/peterbe/hashin/issues/93 was about was being smarter about "merging" the hashes. For example, suppose I take that diff above, but manually, in my editor, rearrange the order so it looks like this:
▶ cat /tmp/reqs.txt
hashin==0.14.1 \
--hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1 \
--hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b
▶ code /tmp/reqs.txt
▶ cat /tmp/reqs.txt
hashin==0.14.1 \
--hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b \
--hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1
(note that the only change is the order of the hashes).
Now, if I were to run hashin
again...:
▶ python hashin.py hashin -r /tmp/reqs.txt --dry-run
You get an empty output because the hashes have NOT changed according to hashin
. It ignores the fact that the order is "wrong".
Basically, as long as you don't manually "mess" with the order of the hashes, it is deterministic. Always. And even if you did mess with the order of the hashes hashin
won't try to fix that.
So, no the order is not canonical but it's certainly deterministic as long as PyPI.org is.
I think that Hashin should take a stronger approach here than "whatever PyPI.org does". Consistency is useful, both for deterministic builds and for smaller, easier to read diffs. I think that sorting the hashes in the file anytime Hashin writes them goes a long way towards that end.
Deterministic build are important because of projects like Debian's ReproducibleBuilds, which strive to increase trust and verifiability of packages. These are some of the same goals that hashing packages in Python is intended to solve.
Another tool that is similar to Hashin also uses sorting to help consistency: Yarn. It sorts entries both in package.json
and also yarn.lock
. It doesn't ever have lists of hashes, but it sorts dependencies alphabetically when listing them.
@di Do you have an opinion on this one? Should the requirements file that hashin
builds sort the hashes lexicographically ordered, no matter what, or should it sort them by the same way that PyPI presents them?
I would argue that it's a no-problem because the only time it comes up is if you compare two different requirements files made in two different places when PyPI might have returned files in a different order. E.g.
# On machine X
$ hashin Django -r reqs-x.txt
# On machine Y
$ hashin Django -r reqs-y.txt
$ rsync reqs-y.txt machineX:
# On machine X
# This *will* diff if the order of files
# on https://pypi.org/pypi/Django/json changed over time.
$ diff reqs-x.txt reqs-y.txt
Contrast with doing it all and always on the same machine:
# Monday
$ hashin Django==2.1.5 -r /path/to/reqs.txt
# Tuesday
# This will ONLY diff if the SET of hashes has changed between Monday and Tuesday
$ hashin Django==2.1.5 -r /path/to/reqs.txt --dry-run
Essentially, what hashin
does when it figures out to bother changing the requirements file is this:
if set(hashes_from_parsing_existing_file) != set(hashes_from_pypi_download):
amend_requiments_file(hashes_from_pypi_download)
if you compare two different requirements files made in two different places when PyPI might have returned files in a different order
In Electrum's case, the file is not always generated by the same developer. I think it's unrealistic to assume in general that frozen dependency files are always updated from the same machine.
@peterbe Thanks for looping me in.
I would expect the hashes to be sorted lexicographically.
The ordering from PyPI doesn't have much meaning to users: it is ordered by the filename of the distribution artifact. Since hashin
doesn't include filenames in it's output, this ordering wouldn't make much sense to a user trying to understand why hashin
insists on it.
I get the feeling that all y'all people are saying that it would be good to NOT rely on the sorting of PyPI and instead always write down the hashes lexicographically sorted. Yes, it has the advantage that the files created a more predictable when compared as a whole.
I don't think the existing logic that determines whether it should amend the hashes should change but when it does it should sort the hashes before writing them.
Yes I think that's a good solution. To restate, if the set of hashes already in the file and the new set of hashes are equal, it's okay to leave the file as is (ignore order when comparing, detect as no change); but if the sets are unequal, what gets written should be lexicographically sorted.
Anybody willing to take a look at https://github.com/peterbe/hashin/pull/110 ?
The unit tests plus my extensive manual testing (in the PR description) makes me pretty confident already but, you know, more eyes etc.
This is now out in version 0.14.5.
Since version 0.14, the order of hashes does not seem to be deterministic. Before 0.14, the hashes used to be lexicographically ordered.
I am guessing this was changed in https://github.com/peterbe/hashin/pull/95
What is the reason for this?
We use hashin in electrum for deterministic builds. The requirements files are generated by this script. The
freeze_packages.sh
script is manually rerun to update dependencies (see e.g. https://github.com/spesmilo/electrum/commit/ba4af29bf82f7addb68c3ed996002665313ab223). If the order of hashes for a given package is not canonical, then rerunning the script can result in changes for packages that have not actually been updated, making the diff considerably less readable.