rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics
https://rapidfuzz.github.io/RapidFuzz/
MIT License
2.61k stars 116 forks source link

Update type annotations for extractOne and extract functions in proce… #374

Closed Vioshim closed 5 months ago

Vioshim commented 5 months ago

I've modified the order of the overloads taking into account that mappings are iterable, the library tends to fall into the first typing overload, and secondly, the typing of mapping is inconsistent with the current behavior as the methods uses dict.items if possible.

These changes should make the overloads properly reflect the expected behavior from the extract functions. This fixes https://github.com/rapidfuzz/RapidFuzz/issues/368

maxbachmann commented 5 months ago

I was not aware that it would go through overloads in the definition order. In that case it certainly makes more sense to define the more specific one first :+1:

In regards to the key type var I would prefer a different name. Currently they mean something like:

After your change _S2 is sometimes the second string type and sometimes the key type. I would prefer to:

Currently all of the TypeVar's unspecific, since I had trouble typing their restrictions properly. Same for the function intputs of scorer and processor. I find that things with dependent, pretty generic types can get really complex :sweat_smile:

These changes should make the overloads properly reflect the expected behavior from the extract functions. This fixes https://github.com/rapidfuzz/RapidFuzz/issues/368

It should probably update the hints in the docstrings of process_py.py to match the type hints as well.

maxbachmann commented 5 months ago

I didn't quite understand how the actual type behind a TypeVar is detected. E.g. for:

    results1: tuple[str, float, int] = process.extractOne("test", ["test"])
    results2: tuple[str, int, int] = process.extractOne("test", ["test"], scorer=Levenshtein.distance)
    results3: tuple[str, int, int] = process.extractOne("test", ["test"])
    results4: tuple[str, float, int] = process.extractOne("test", ["test"], scorer=Levenshtein.distance)

I would like result3 to lead to an error when validation the typing, while the others are valid. I completely failed to achieve this though.

Vioshim commented 5 months ago

I'm currently working on some changes in the PR, however I'll make sure to take a look into it ^^

Vioshim commented 5 months ago

My apologies for the amount of changes, but this implementation should solve the commented issues alongside TypeVar being determined by the implemented scorer

maxbachmann commented 5 months ago

Could we back out the renaming from _S1/_S2 to _StringType1/_StringType2 to a separate pull request? This would probably reduce the noise by quite a bit here and this change is guaranteed not to break things.

Same for the reordering of the Mapping/Sequence type hints. I already confirmed that one to be a strict improvement.

I am currently in the process of writing unit tests for the type hints which: 1) allows me to validate your changes to improve the type deductions 2) makes sure we don't silently regress on this in the future

maxbachmann commented 5 months ago

The scorer improvement to _StringType1 = TypeVar("_StringType1", bound=Sequence[Hashable]) instead of either Sequence[Hashable] or _StringType1 = TypeVar("_StringType1") was only done in some places so far. As far as I can see, this is working as intended and should be done for all scorers (so e.g. Levenshtein.pyi or fuzz.pyi should have the same change).

Deducing the type in cases that currently directly use Sequence[Hashable] should allow us to update the processor callback typing to Callable[[_StringType1 | _StringType2], Sequence[Hashable]] | None

At this point this should probably be moved to a commit / PR I can merge separately as well, since it's going to be a relatively big change making the same adjustment everywhere.

Edit: thinking about this a bit more it's actually more complex. The problem is that the algorithm itself requires a Sequence[Hashable]. This is after the preprocessing function is called. So if the user didn't provide a preprocessing function, it's true that both query and choices would need to be a Sequence[Hashable]. On the other hand if a processor function was passed, all that really matters is that processor is callable using either query or choice and that it returns a Sequence[Hashable]. This allows things like this:

class MyClass:
    def __init__(self) -> None:
         self.a: str = ""

Levenshtein.distance(MyClass(), MyClass(), processor=lambda x: x.a)

I think this could be achieved using

_StringType1 = TypeVar("_StringType1", bound=Sequence[Hashable])
_StringType2 = TypeVar("_StringType2", bound=Sequence[Hashable])

_UnprocessedType1 = TypeVar("_UnprocessedType1")
_UnprocessedType2 = TypeVar("_UnprocessedType2")

@overload
def distance(
    s1: _StringType1,
    s2: _StringType2,
    *,
    weights: tuple[int, int, int] | None = (1, 1, 1),
    processor: None = None,
    score_cutoff: int | None = None,
    score_hint: int | None = None,
) -> int: ...

@overload
def distance(
    s1: _UnprocessedTypeType1,
    s2: _UnprocessedTypeType2,
    *,
    weights: tuple[int, int, int] | None = (1, 1, 1),
    processor: Callable[[_UnprocessedTypeType1 | _UnprocessedTypeType2], Sequence[Hashable]],
    score_cutoff: int | None = None,
    score_hint: int | None = None,
) -> int: ...

Not sure how to mention this in the docstring so we keep this somewhat user readable though :sweat_smile:

maxbachmann commented 5 months ago

For internal files I am actually thinking about removing the typing, since I find that it mostly makes my life harder. In addition I find that leaving them out of the documentation is less cluttered:

image

vs

image

Since we describe the types in the docstring this is just duplicated info anyways, while in the docstrings we can actually explain things better. The external APIs are unaffected by this, since those provide type hints for everything visible to the user anyways.

Vioshim commented 5 months ago

https://github.com/rapidfuzz/RapidFuzz/pull/374#issuecomment-2049204850 should be fixed with the last commit

maxbachmann commented 5 months ago

https://github.com/rapidfuzz/RapidFuzz/pull/374#issuecomment-2049204850 should be fixed with the last commit

As far as I understand you now allow calling the function with any object unrelated to the whether a preprocessing functions is called.

Whats the overall state of this? I would like to:

There are probably still some other issues and I had some more thoughts about how the scorer function call could be typed at least partially. I would prefer doing this once we have merged some of the simple stuff first though.

Vioshim commented 5 months ago

Been trying to revert, just has been a bit difficult, I'll try to do the revert manually

maxbachmann commented 5 months ago

@Vioshim do you plan to contribute more improvements to the type hints? Otherwise I can make them as well.

Vioshim commented 5 months ago

Ye, I'm intending to, I'll first work on the rename of variables