Add a candidate validation step

FFY00 commented 3 years ago

I am implementing a resolver for Python packages, which needs to support extras. Unfortunately, verifying if a candidate archive provides a specified extra is expensive, as I need to download the archive and possible build a wheel, and that is required to make sure the candidate is valid.

The thing is, the resolver will only choose one version and try to work with that, only passing the next one if there are incompatibilities.

Would it make sense to add a is_valid_candidate method, or something similar, that the resolver would call before trying to work with a candidate? For backwards compatibility, we should provide a default implementation would always validate.

This issue in particular should be somewhat mitigated by https://www.python.org/dev/peps/pep-0658/, but that is still a draft, but there would still be cases where the metadata could be dynamic, where I would have to download the sdist and build the wheels. I expect some users with other use-cases to be in a similar situation.

pradyunsg commented 3 years ago

Any candidate "validation" should happen in the provider, before providing the candidate to the resolver, within find_matches.

FFY00 commented 3 years ago

Right, I based my code on the example, which is not very efficient. That is not a problem in the provided code, but it becomes one in this case, where I introduce expensive operations.

The solution was pretty straightforward, make sure find_matches returns an iterator callback, so that it can consume candidates as it goes, and make sure get_preference does not wastefully force the consumption of candidates.

FFY00 commented 3 years ago

The code is on https://github.com/FFY00/python-resolver FYI.

uranusjr commented 3 years ago

I did consider a validation step, but the issue is most usages wouldn’t need that step, and would be “dead weight” for implmenetations that do things properly. But maybe we could have something like Resolver(dev=True) to enable a “dev mode” that enables additional checks whether the provider is doing sane things. (It might not solve your problem though since you were consuming the candidates in get_preference, which is out of the resolver’s control.)

FFY00 commented 3 years ago

Well, I don't think it should introduce that much overhead if the implementation is just a return True, which is what would match the current behavior.

The issue here is that consuming candidates in get_preference triggers expensive operations, even though I don't really need that information there. The current design makes returning a generator callback pretty much useless if I am actually doing something in get_preferrence, because to do something meaningful there, I will have to consume the candidates. If we add candidate validation, I could consume everything in get_preference, have the resolver pick a candidate, and only then verify if it is valid or if the resolver should proceed to the next pick, instead of forcing this verification to happen for all possible candidates.

This would essentially mean that the result of find_matches starts being possible candidates, not necessarily valid candidates, so that we can postpone the validation step and only do that if we actually need to.

uranusjr commented 3 years ago

The current design makes returning a generator callback pretty much useless if I am actually doing something in get_preferrence, because to do something meaningful there, I will have to consume the candidates.

Then you don't return a generator function? You can return a sequence instead and that will be used as-is without overhead. The generator function return type is for implementations that can't do anything meaningful to the candidate list, which is (you guessed it) pip. If you don't have that problem, you don't need to constrain yourself.

FFY00 commented 3 years ago

Yes, but I want the candidates to be lazily generated. Determining if a candidate is actually valid or not is expensive in this use-case, so I don't want to that unless I absolutely have to.

In the current API, I am required to only return valid candidates, this means the expensive operation will have to happen for all candidates I return. Using a generator lets me delay that operation to when the resolver decides to consume it. This is approach is fine if you don't use get_preference, if you do, you'll consume the candidates anyway.

Giving a more concrete example, I have the following requirement, build[virtualenv]. For a wheel to be a valid candidate, it needs to provide the virtualenv extra, which is something I can only know by downloading the wheel and looking at METADATA. Right now, for each candidate that comes out of find_matches, I need to download the wheel and make sure it provides the virtualenv extra, because the candidate needs to be valid. I would like to be able to return all build wheels as candidates, and only verify if the wheel provides the virtualenv extra if the resolver decides to pick it. Does that make sense?

uranusjr commented 3 years ago

Would #63 provide the same mechanism? Basically an additional hook when each candidate comes out of the iterable to ask “does this one work” and discard when it does not.

FFY00 commented 3 years ago

I don't know if is_satisfied_by is used anywhere else, where it could cause issues, but if it isn't, then yes. #63 should resolve this.

uranusjr commented 3 years ago

is_satisfied_by is currently called after get_dependencies, so what’s required is basically swapping the two. Again, it would make the resolver slightly slower since the hook will be called more than it currently is, but hopefully I can make it not too bad with slight tweaks.

FFY00 commented 3 years ago

It's you call :stuck_out_tongue:. You can do that or introduce a new callback just to validate the candidate after picking. I am happy with either one.

sarugaku / resolvelib

Add a candidate validation step #79