`find_matches` is underspecified - duplicate candidates

pfmoore commented 4 years ago

The specification of the provider's find_matches method doesn't include any information about whether candidates need to be "unique". To give an example, consider two requirements, pip >= 19.0 and pip >= 20.0. The candidate pip-20.0-py3-none-any.whl satisfies both of these.

When a client implements find_matches on a provider, is it necessary that the same candidate is returned in both calls, or is it enough that "equivalent" candidates are returned? (To be honest, I'm not even clear what it means to be the "same" candidate here - is object identity enough?)

Reasons this matters:

If methods on the candidate object are expensive to calculate (for pip, identify could involve building the project to get the project name), we want to avoid doing this multiple times if it's not needed.
If different candidate objects are returned, will resolvelib (potentially) consider both of them, which would duplicate work (again, particularly expensive if we need to build projects as part of calling methods on the candidate). Or will "equivalent" candidates get merged?

I can look at the existing code to determine how things work, but this should be documented so that the implementation isn't constrained to keep internal details the same because clients rely on them.

uranusjr commented 4 years ago

is it necessary that the same candidate is returned in both calls, or is it enough that "equivalent" candidates are returned?

The resolver does not compare candidates with each other, exactly because of the reason you raised: it does not (cannot) assume this is a sensical thing to do. So yes, it will result in duplicated work if equivalent (whatever this means) candidates are returned by find_matches().

I would incline to treat this as an optimisation problem; we be conservative right now and return some equivalent candidates if we’re not sure, and slowly figure out how to eliminate them. I also feel this would not be a very big problem in practice for pip, since PackageFinder already eliminates a lot of the duplicates. The only source of duplication would be direct URL and local source dir, either is used very much currently AFAICT since the current legacy resolver does not handle them very well.

pradyunsg commented 4 years ago

And, one (nice?) thing about the separation of concerns in this API design, is that the optimization can/should happen on the Provider side, which is best positioned to correctly identify and cache "equivalent" candidates.

pfmoore commented 4 years ago

Cool, I'm happy with that. But just to be clear, if I follow the logic in the code:

The first requirement with a given identify() value (the reqirement's "name") has find_matches() called for it.
Subsequent requirements are merged - we never even call find_matches() (maybe except if we backtrack, I never checked that code yet).

So the question of "multiple copies of the same candidate" never even crops up in the resolution code.

IMO, at some point this should be added to the docs, as a clarification. But for now I'm happy to simply have this issue as a reference.

It's easy to lose track of this when writing Requirement and Candidate objects that have the provider methods delegated to them (like the pip prototype does at the moment). I'm wondering whether it was a mistake to do that. Cue rewrite number 20 of the pip integration code 😉

pradyunsg commented 4 years ago

I'm honestly a little concerned with the delegating that we're doing in our implementation, since it feels like more refactoring work later to cleanup responsibilities. But, yea, it's not a major concern but more of a back of the head thought atm.

sarugaku / resolvelib

`find_matches` is underspecified - duplicate candidates #32