pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.39k stars 2.99k forks source link

People are distributing invalid wheels because pip is sloppy about checking them; pip should be pickier #3513

Open njsmith opened 8 years ago

njsmith commented 8 years ago

See:

https://github.com/techtonik/leanbook/blob/master/manuscript/07wheel-anatomy.md https://github.com/techtonik/RHVoice/blob/d08c7244c4fd47daa6ff9576f8c9ca4491c87d6e/src/nvda-synthDriver/PACKING.md

which recommend uploading wheels to pypi in which the METADATA and RECORD files are 0 bytes long.

techtonik commented 8 years ago
  1. It is not recommendation, but the way things work
  2. Do you have real arguments, because right now there are none

2nd point is really important, because for many people including me the necessity of METADATA and RECORD stuff is not obvious. Wheels work without this complication, so why do you need to add it?

pfmoore commented 8 years ago

+1 on pip rejecting wheels which don't contain the minimal metadata.

@techtonik For "real reasons", the spec for a wheel at https://www.python.org/dev/peps/pep-0427/#the-dist-info-directory states clearly that METADATA must contain data in the same format as PKG-INFO at the root of sdists (which in turn is specified by PEP 345), and that RECORD must contain a list of (almost) all files in the archive.

So wheels that don't do that are invalid, and should be rejected. Up until now, pip has trusted that wheels will be valid, simply because in practice only bdist_wheel was used to produce them. But if wheels are appearing in the wild that don't conform to spec, we should stop that sooner rather than later.

pfmoore commented 8 years ago

And just to clarify, the reasons for putting valid data in those files:

  1. Metadata - because that file gets installed, and contains important information such as who wrote the package.
  2. RECORD - to validate the contents of the wheel have not been corrupted. The spec actually notes that installers should check this, and pip (apparently) currently doesn't.
techtonik commented 8 years ago

@pfmoore without solid arguments it will be another bad solution that people like me will hate.

METADATA must contain data in the same format as PKG-INFO at the root of sdists (which in turn is specified by PEP 345)

Too many redirections create the same complexity problem that the wheel format was promised to avoid. must doesn't explain why wheel can't come with its own metadata that makes sense for wheels and don't have the burden of backward compatibility of PKG-INFO, leaving this burden to a wheel-distutils mapper inside of pip, which needs this mapper anyway. Maybe obsession with PKG-INFO is the reason why conda and alternatives exist.

, and that RECORD must contain a list of (almost) all files in the archive.

From the security point of view there is no "incomplete checksum" and "no checksum" is the same thing. .zip files are already protected from corruption by default, so what are you protecting from? Do you have real tests for that? Adding additional step for maintaining the internal checksum that will never be used because corrupted .zip already fails to unpack? I don't get it.

If my build system produced corrupted binary, my packaging toolchain will make a proper checksum for that. Doing proper RECORD complicates wheel packaging and I don't see for which sake.

METADATA should not be necessarily in PKG-INFO format and wheels should not distribute the pieces of target catalog. It is the target system which should decide which fileds should be present to accept the wheel. PyPI needs author fields, but local install doesn't. Some wheels have dozens of authors, but METADATA has only one. The author is therefore useless for my scenario, especially when I don't use email fields, but GitHub (or Debian) IDs, but PKG-INFO tells that author is single and it is required.

There are more arguments here https://github.com/pypa/wheel-builders/issues/1#issuecomment-188669658

So, to summarize - I think that the fact that wheels chosen not to implement old cliches is the right thing. It should make at least RECORD optional until there is a system that shows a value added behavior for this file that really benefits everyone in practice. I won't argue that METADATA is useful, but I think that omitting it is a feature. It could be beneficial for local development and for simplicity. PyPI or conda package sites may choose to ignore wheels with insufficient metadata, but chances are that this info is already there in some more human readable format and it is easier for everybody to set the package information through PyPI API directly than to use this strange format of ZIP-RPC.

RonnyPfannschmidt commented 8 years ago

there are solid arguments

wheels are supposed to be install-able by running unzip and then a compile-all module afterwards it is expected that there is valid meta-data for the files that came with the wheel

so making them optional breaks a promise, meta-data consistency should be enforced

pfmoore commented 8 years ago

@techtonik Feel free to hate it, but the wheel spec says what it does, and calling something that doesn't follow that spec a wheel is wrong. If you don't like the spec, you can propose an updated version by going through the normal PEP process.

So I remain a strong +1 on rejecting malformed wheels. If no-one else comes up with a PR for this, I'll produce one at some point.

techtonik commented 8 years ago

wheels are supposed to be install-able by running unzip

What means install-able in your case? wheel is a .zip file, which already has a list of files and a checksum. How the presence of RECORD file makes it more install-able? How the presence of METADATA makes it more install-able? Why pip chooses to ignore these files?

and then a compile-all module afterwards it is expected that there is valid meta-data for the files that came with the wheel

I can only see the bytecode compilation use case here, so the arguments must be related to it. expected is not an argument. What exactly is this valid meta-data that is needed for your case and why? If I understand it right, compile-all compiles bytecode. It might need a list of file, but why it needs checksum? It is also not clear why it is compulsory to byte-compile all files that come with wheel.

techtonik commented 8 years ago

@pfmoore it is your choice. You know that I hate that PEP bureacracy and won't follow it. The fact that the points raised here can not be answered by giving direct links to PEP just enforces my feelings. But your action will make job of tools developers more complicated that necessary. Why will it happen? As I see it, to get rid of the nuisance. This could be an interesting pattern for manipulation in case I really needed to make things worse. It directly depends on how much free time people have to discuss such issues, and the only way to get protected from it is to iteratively improve the information passing - direct links to paragraphs, improved wording, real world use cases/practices and diagrams.

I don't know how deep @njsmith is aware why various WHEEL requirements are existing, but my bet that he is not aware, so the only way for him and everybody to support this discussion is to believe that some smart guys are already thought about it and choose the side. https://en.wikipedia.org/wiki/Cargo_cult is not the right analogy, so maybe you can point me to a better pattern that can pass information about such behaviour?

RonnyPfannschmidt commented 8 years ago

@techtonik after a unzip into a random folder in pythonpath, the package should work and have the valid metadata for package discovery - this is python package metadata and not related to zip file metadata at all, zip files are merely used to compact and bundle, their own metadata is irrelevant

BTW: consistent tolling is favorable over a convenient but inconsistent mess, this is about packaging, doing sloppy engineering is a disservice for all users of the tools

techtonik commented 8 years ago

@RonnyPfannschmidt without a real instruction "package discovery from unpacked wheels" for tool writers it starts to sound too abstract. You say that unpacking should work, but how do you know that it does really feasible? pip doesn't do this for a reason, and my bet that this reason is valid for others too.

Maintaining RECORDS file is an additional hassle, so there should be not only a good reason to do so, but also a real example that it works and provides planned benefits (which are still not referenced). So, here is the new information - the primary use case for wheels is unpacking them manually into PYTHONPATH. Why it should not work for wheels without METADATA or RECORDS? python package can be discovered and imported by Python without problem, and if you need pip package then it can get all the required info from the PyPI.

So, what are you going to fix? Do you want to "fix" pip so that it should not unpack files that are not present in RECORD files? But when you unpack .zip yourself, all files are unpacked, regardless of presence in RECORD file, so what is the role for it? You want to "fix" pip to require METADATA with AUTHOR fields in them, but what purpose are you going to archive besides some mental consistency?

It looks like I am having a truck with wheels, but I can't ride EU roads, because my wheels are not branded. I can understand that my wheels should be certified if they can damage the road (digital signatures if you need security), but I don't understand why you insist on naming them to your liking.

RonnyPfannschmidt commented 8 years ago

that comparisation is unjustified and misfit

in the end not checking for correct wheels is a mistake in pip that should get fixed

if you don't like how wheels should look, then don't generate them or change the standard, anything else is very likely a disservice to users and the ecosystem

pfmoore commented 8 years ago

I propose the following checks:

  1. That the wheel has a METADATA file conforming to the PEP 314 spec. RFC822 format, with a subset of the mandatory keys present - Metadata-Version, Name, Version, Summary, Author-Email, License.
  2. That the wheel has a RECORD file with lines for METADATA, WHEEL and RECORD. I'd like to think that checking hashes is overkill. However, the page linked in the issue report seems to be recommending deliberately omitting required aspects of the spec if you can get away with it, so I'm forced to consider that checking a couple of hashes (METADATA and WHEEL) might be sufficient to make omitting hashes too hard to be worth bothering about.
  3. That all files in the wheel are mentioned in RECORD.

These checks are not comprehensive - the idea is to ensure people aren't uploading corrupt wheels, not to act as a full validation suite.

I'm honestly saddened that we need to do this. Deliberate violation of the wheel spec seems like the sort of bevaviour that should be fixable simply by letting the author of the invalid wheel know of their mistake. Having to defend against deliberate attempts to undermine the standardisation work going on is depressing.

dholth commented 8 years ago

IIRC the wheel unpack command will check everything. Shame on pip for not bothering. The hashes exist so that you could a) verify that nothing has changed after an installation, potentially warning the user that they made local edits b) digitally sign RECORD, and for example have a digital signature that was still valid if you decided to change the compression algorithm used in the zip file.

It is so trivial to generate the correct hashes I'm surprised it causes so much consternation. Write a tool for heavens sake. For example in the wheel tracker there's an issue to create a 'wheel pack' command that would archive a directory updating RECORD hashes.

dholth commented 8 years ago

If you don't include METADATA then packaging tools may have trouble realizing that your package is installed and at what version.

techtonik commented 8 years ago

that comparisation is unjustified and misfit

Please use arguments. Closing eyes looks like you don't want to deal with a problem that you solved for yourself. But if you can address the points, it may happen that it is me who is blind and don't see something.

techtonik commented 8 years ago

The hashes exist so that you could a) verify that nothing has changed after an installation, potentially warning the user that they made local edits

So it is pip who should create those hashes.

b) digitally sign RECORD, and for example have a digital signature that was still valid if you decided to change the compression algorithm used in the zip file.

So it is completely optional. Why require it for non-signed wheels?

techtonik commented 8 years ago

If you don't include METADATA then packaging tools may have trouble realizing that your package is installed and at what version.

But name and version is already present in wheel naming standard - https://www.python.org/dev/peps/pep-0427/#file-name-convention And when packaging tool installs wheels, it is it's own way how to record information about installed package - it may use SQLite, index files or OS-native database.

In fact, I wish I could just unpack wheels in Linux without all that METADATA stuff. For development it is important to build wheels for CI without stumbling upon obligatory details like version keeping hassle, which makes sense only for releases.

techtonik commented 8 years ago

It is so trivial to generate the correct hashes I'm surprised it causes so much consternation.

There are many such things around (like papers that are easy to fill), so a problem could be that some of these rituals are just don't make sense.

techtonik commented 8 years ago

Or that there are some people that don't see any meaning in those rituals.

pradyunsg commented 6 years ago

Ping?

Is everyone fine with moving forward with @pfmoore's suggested checks? If not, what do you want more/less?

RonnyPfannschmidt commented 6 years ago

i think so, the prior discussion made it clear that the python ecosystem needs to be protected from people that think its ok to breaks standards if it somehow works

pfmoore commented 6 years ago

I'm OK with adding the checks, but I will point out that it seems to be only the OP who is proposing to ignore the standards.

I think the important point here is #4705, which points out that pip isn't following the wheel spec in verifying that files are recorded in the RECORD file and hashes match. But I'm inclined to otherwise ignore the posturing in this issue about how it's possible to sneak non-standard wheels past pip.

If someone wants to implement the suggested checks, though, I'm OK with that. But I did say above that "I'd like to think that checking hashes is overkill". In fact the spec (https://www.python.org/dev/peps/pep-0427/#the-dist-info-directory item 8) says that installers should verify hashes, so I'm wrong on that. (The wording doesn't use normal standards "MUST/SHOULD" terminology, so it's not 100% clear whether checking is required or merely good practice, but it's clearly the intent of the spec that hashes get checked).

njsmith commented 6 years ago

Checking hashes is also just good general practice to catch accidental data corruption caused by bugs or hardware issues.

mhsmith commented 6 years ago

The wording doesn't use normal standards "MUST/SHOULD" terminology, so it's not 100% clear whether checking is required or merely good practice

The wording could have been clearer, but a little later in the PEP it says this:

A wheel installer is not required to understand digital signatures but MUST verify the hashes in RECORD against the extracted file contents.

And it gives a rationale:

When the installer checks file hashes against RECORD, a separate signature checker only needs to establish that RECORD matches the signature.

RonnyPfannschmidt commented 5 years ago

@stonebig those things are entirely orthogonal as of now

techtonik commented 5 years ago

i think so, the prior discussion made it clear that the python ecosystem needs to be protected from people that think its ok to breaks standards if it somehow works

I don't like ecosystem that doesn't follow common sense, so my final comment here is to unsubscribe. There are many things besides Python wheels out there that don't need any standard vs practicality debate.

If you want something modern for consistency - take a look at https://github.com/centrifuge/precise-proofs, build Python cloud with content addressing, repeatable builds and packages as catalogs.