pypa / auditwheel

Auditing and relabeling cross-distribution Linux wheels.
Other
443 stars 145 forks source link

Grafted libraries can end up with the same hash but different contents #389

Open lkollar opened 2 years ago

lkollar commented 2 years ago

The numpy==1.18.4 and scipy==1.4.1 libraries both contain the libopenblasp-r0-34a18dc3.3.7.so file, as they most likely used the same version when invoking auditwheel. The hashes match, but the files differ in contents:

$ cmp numpy.libs/libopenblasp-r0-34a18dc3.3.7.so scipy/.libs/libopenblasp-r0-34a18dc3.3.7.so
numpy.libs/libopenblasp-r0-34a18dc3.3.7.so scipy/.libs/libopenblasp-r0-34a18dc3.3.7.so differ: byte 57, line 1

Most likely this is happening because we compute the hash before performing the patching on the files and using a different version of patchelf can produce slightly differenet .so files:

https://github.com/pypa/auditwheel/blob/ef280f27759ceaf9acd84e606ef4bfc73d5b0a64/src/auditwheel/repair.py#L136-L150

Given that the resulting files are practically identical, they should not cause any issues, even if for some reason the loader decides to reuse the cached version, but I think we should generate the hash after modifying the file.

Another side effect of generating the hash before grafting is that it is impossible to validate the hash, if one wanted to make sure no modifications have been done to the grafted library.

lkollar commented 2 years ago

Another side effect of generating the hash before grafting is that it is impossible to validate the hash, if one wanted to make sure no modifications have been done to the grafted library.

This is actually causes an issue in the code snippet I pasted above: the code tries to avoid patching an already patched library, but it does so based on the generated hash. The else on line 143 will never be reached because the hash is computed before we make the modification. The next time auditwheel runs, the hash will be different, since we've modified the binary.

As we now use the hash as the SONAME, which is embedded in the file, there is no way to perform the hashing after the modification. If we want to avoid modifying a library we've already patched before, we could add some information in the NOTE section maybe. But given that it doesn't seem that this feature works properly, maybe it's not worth fixing it and we could just switch to different way to make the SONAME unique.