Closed Vioshim closed 5 months ago
I was not aware that it would go through overloads in the definition order. In that case it certainly makes more sense to define the more specific one first :+1:
In regards to the key type var I would prefer a different name. Currently they mean something like:
_S1
-> string type 1_S2
-> string type 2After your change _S2
is sometimes the second string type and sometimes the key type. I would prefer to:
_S2
everywhere as the second string type (would be fine to use something like _StringType1
and _StringType2
instead to make this more clear)_KeyType
for the keyCurrently all of the TypeVar
's unspecific, since I had trouble typing their restrictions properly. Same for the function intputs of scorer and processor. I find that things with dependent, pretty generic types can get really complex :sweat_smile:
These changes should make the overloads properly reflect the expected behavior from the extract functions. This fixes https://github.com/rapidfuzz/RapidFuzz/issues/368
It should probably update the hints in the docstrings of process_py.py
to match the type hints as well.
I didn't quite understand how the actual type behind a TypeVar is detected. E.g. for:
results1: tuple[str, float, int] = process.extractOne("test", ["test"])
results2: tuple[str, int, int] = process.extractOne("test", ["test"], scorer=Levenshtein.distance)
results3: tuple[str, int, int] = process.extractOne("test", ["test"])
results4: tuple[str, float, int] = process.extractOne("test", ["test"], scorer=Levenshtein.distance)
I would like result3 to lead to an error when validation the typing, while the others are valid. I completely failed to achieve this though.
I'm currently working on some changes in the PR, however I'll make sure to take a look into it ^^
My apologies for the amount of changes, but this implementation should solve the commented issues alongside TypeVar being determined by the implemented scorer
Could we back out the renaming from _S1
/_S2
to _StringType1
/_StringType2
to a separate pull request? This would probably reduce the noise by quite a bit here and this change is guaranteed not to break things.
Same for the reordering of the Mapping/Sequence type hints. I already confirmed that one to be a strict improvement.
I am currently in the process of writing unit tests for the type hints which: 1) allows me to validate your changes to improve the type deductions 2) makes sure we don't silently regress on this in the future
The scorer improvement to _StringType1 = TypeVar("_StringType1", bound=Sequence[Hashable])
instead of either Sequence[Hashable]
or _StringType1 = TypeVar("_StringType1")
was only done in some places so far. As far as I can see, this is working as intended and should be done for all scorers (so e.g. Levenshtein.pyi
or fuzz.pyi
should have the same change).
Deducing the type in cases that currently directly use Sequence[Hashable]
should allow us to update the processor callback typing to Callable[[_StringType1 | _StringType2], Sequence[Hashable]] | None
At this point this should probably be moved to a commit / PR I can merge separately as well, since it's going to be a relatively big change making the same adjustment everywhere.
Edit: thinking about this a bit more it's actually more complex. The problem is that the algorithm itself requires a Sequence[Hashable]
. This is after the preprocessing function is called. So if the user didn't provide a preprocessing function, it's true that both query and choices would need to be a Sequence[Hashable]
. On the other hand if a processor
function was passed, all that really matters is that processor is callable using either query or choice and that it returns a Sequence[Hashable]
. This allows things like this:
class MyClass:
def __init__(self) -> None:
self.a: str = ""
Levenshtein.distance(MyClass(), MyClass(), processor=lambda x: x.a)
I think this could be achieved using
_StringType1 = TypeVar("_StringType1", bound=Sequence[Hashable])
_StringType2 = TypeVar("_StringType2", bound=Sequence[Hashable])
_UnprocessedType1 = TypeVar("_UnprocessedType1")
_UnprocessedType2 = TypeVar("_UnprocessedType2")
@overload
def distance(
s1: _StringType1,
s2: _StringType2,
*,
weights: tuple[int, int, int] | None = (1, 1, 1),
processor: None = None,
score_cutoff: int | None = None,
score_hint: int | None = None,
) -> int: ...
@overload
def distance(
s1: _UnprocessedTypeType1,
s2: _UnprocessedTypeType2,
*,
weights: tuple[int, int, int] | None = (1, 1, 1),
processor: Callable[[_UnprocessedTypeType1 | _UnprocessedTypeType2], Sequence[Hashable]],
score_cutoff: int | None = None,
score_hint: int | None = None,
) -> int: ...
Not sure how to mention this in the docstring so we keep this somewhat user readable though :sweat_smile:
For internal files I am actually thinking about removing the typing, since I find that it mostly makes my life harder. In addition I find that leaving them out of the documentation is less cluttered:
vs
Since we describe the types in the docstring this is just duplicated info anyways, while in the docstrings we can actually explain things better. The external APIs are unaffected by this, since those provide type hints for everything visible to the user anyways.
https://github.com/rapidfuzz/RapidFuzz/pull/374#issuecomment-2049204850 should be fixed with the last commit
https://github.com/rapidfuzz/RapidFuzz/pull/374#issuecomment-2049204850 should be fixed with the last commit
As far as I understand you now allow calling the function with any object unrelated to the whether a preprocessing functions is called.
Whats the overall state of this? I would like to:
*_py.py
There are probably still some other issues and I had some more thoughts about how the scorer function call could be typed at least partially. I would prefer doing this once we have merged some of the simple stuff first though.
Been trying to revert, just has been a bit difficult, I'll try to do the revert manually
@Vioshim do you plan to contribute more improvements to the type hints? Otherwise I can make them as well.
Ye, I'm intending to, I'll first work on the rename of variables
I've modified the order of the overloads taking into account that mappings are iterable, the library tends to fall into the first typing overload, and secondly, the typing of mapping is inconsistent with the current behavior as the methods uses dict.items if possible.
These changes should make the overloads properly reflect the expected behavior from the extract functions. This fixes https://github.com/rapidfuzz/RapidFuzz/issues/368