Open tonyhammainen opened 1 month ago
Hi @tonyhammainen, we really do have a duplicate RecognizerResult
in both packages, as the two packages are currently independent. The solution would be to have a base package which both packages use, but it's not on our roadmap to add this, as there is no functional benefit. Having said that, I don't see why this would be a mypy issue, as those packages are independent and each accepts its own type of recognizer result.
on the ad_hoc_recognizers
object, I'm not sure, because as you said, PatternRecognizer
is a child of EntityRecognizer
Hi @omri374, thanks for the reply!
The intended pattern is to use both libraries sequentially, first getting recognized PII out of analyzer, and inputting that for the anonymizer to anonymize. Or have I misunderstood?
If so, this means that we are inputting RecognizerResult
of analyzer package to the anonymization function of anonymizer package, necessarily leading to mypy errors in the end-user app, if anonymizer package is only typed for its own version of RecognizerResults
, which de-facto is something that nobody ever uses?
Tbh, I don't see why you are not including analyzer as a dependency of anonymizer, given the main usage pattern for anonymizer relies on analyzer?
If I have misunderstood how anonymizer package should be used do enlighten me!
I can confirm the issue with pyright (pylance) by using the first example from the quickstart: https://microsoft.github.io/presidio/getting_started/
If the libraries MUST be independent, then the conversion between the two classes should be done by the user. For example this way:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine, RecognizerResult
text = "My phone number is 212-555-5555"
# Set up the engine, loads the NLP module (spaCy model by default)
# and other PII recognizers
analyzer = AnalyzerEngine()
# Call analyzer to get results
results = analyzer.analyze(text=text, entities=["PHONE_NUMBER"], language="en")
print(results)
# Convert recognizer results to RecognizerResult objects from presidio_anonymizer
results = [
RecognizerResult(
entity_type=result.entity_type,
start=result.start,
end=result.end,
score=result.score,
)
for result in results
]
# Analyzer results are passed to the AnonymizerEngine for anonymization
anonymizer = AnonymizerEngine()
anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
print(anonymized_text)
Over time, the difference between RecognizerResult
and RecognizerResult
will grow, so the above solution provides a somewhat typed-checked way to keep our sanity. In fact, they are already different: the analyzer's RecognizerResult
has extra information like recognition_metadata
and analysis_explanation
.
Another approach might be to refactor the anonymizer so that it depends on the protocol of the PIIEntity
(right now it's an abstract class) instead of the concrete RecognizerResult
implementation. This will keep the analyzer and anonymizer independent, but provide a bridge between two RecognizerResult
classes. If it makes sense to you @omri374 and you see that it's doable - I could give it a try.
Additionally, I'd suggest deprecating one of the RecognizerResult
(probably the one on the anonymizer side) and creating a different class name, just to avoid user confusion.
Hi @eicca, thanks for your suggestions. We should also think of backward compatibilty here. In my view, the optimal solution is to have a shared library. One option is to have the anonymizer as a dependency of the analyzer. The second is to create a presidio-core library which is used by both.
Hi @eicca, thanks for your suggestions. We should also think of backward compatibilty here. In my view, the optimal solution is to have a shared library. The second is to create a presidio-core library which is used by both.
Thanks for your reply @omri374!
One option is to have the anonymizer as a dependency of the analyzer.
This totally makes sense to me, especially since it won't bring a lot of unnecessary dependencies for people who want to use only analyzer (in fact, just pycryptodome
, which is most probably already required by other dependencies). I can try to prepare a PR if you're open to it.
The second is to create a presidio-core library which is used by both.
This also makes sense. This can also be done at a later time after the previous step is done.
Thanks! A PR adding the anonymizer to the analyzer sounds great.
Describe the bug As of
2.2.33
I was not getting any type-related errors, but upgrading to the latest2.2.354
resulted in getting them.error: Call to untyped function "AnonymizerEngine" in typed context [no-untyped-call]
^ Initializing the AnonymizerEngine throws an error. I believe it is because the class has no typed init function
error: Argument "ad_hoc_recognizers" to "analyze" of "AnalyzerEngine" has incompatible type "list[PatternRecognizer]"; expected "list[EntityRecognizer] | None" [arg-type]
^ I do not understand why mypy is not picking the fact thatPatternRecognizer
is a child class ofEntityRecognizer
Argument "analyzer_results" to "anonymize" of "AnonymizerEngine" has incompatible type "list[presidio_analyzer.recognizer_result.RecognizerResult]"; expected "list[presidio_anonymizer.entities.engine.recognizer_result.RecognizerResult]" [arg-type]
^ I believe the above is a result of theRecognizerResult
class inpresidio-anonymizer
being a copy of the class inpresidio-analyzer
, and mypy doesn't understand their equivalencyTo Reproduce Steps to reproduce the behavior:
AnonymizerEngine()
Expected behavior No mypy errors when using the library as expected