microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.36k stars 525 forks source link

Recent changes broke types #1396

Open tonyhammainen opened 1 month ago

tonyhammainen commented 1 month ago

Describe the bug As of 2.2.33 I was not getting any type-related errors, but upgrading to the latest 2.2.354 resulted in getting them.

error: Call to untyped function "AnonymizerEngine" in typed context [no-untyped-call]

^ Initializing the AnonymizerEngine throws an error. I believe it is because the class has no typed init function

error: Argument "ad_hoc_recognizers" to "analyze" of "AnalyzerEngine" has incompatible type "list[PatternRecognizer]"; expected "list[EntityRecognizer] | None" [arg-type] ^ I do not understand why mypy is not picking the fact that PatternRecognizer is a child class of EntityRecognizer

Argument "analyzer_results" to "anonymize" of "AnonymizerEngine" has incompatible type "list[presidio_analyzer.recognizer_result.RecognizerResult]"; expected "list[presidio_anonymizer.entities.engine.recognizer_result.RecognizerResult]" [arg-type] ^ I believe the above is a result of the RecognizerResult class in presidio-anonymizer being a copy of the class in presidio-analyzer, and mypy doesn't understand their equivalency

To Reproduce Steps to reproduce the behavior:

  1. pip install analyzer, anonymizer == 2.2.354
  2. In project, include:
    • instantiation of a AnonymizerEngine()
    • call AnonymizerEngine.anonymize(ad_hoc_recognizers = <list[PatternRecognizer]>)
    • call AnonymizerEngine.anonymize(analyzer_results= <list[presidio_analyzer.recognizer_result.RecognizerResult]>)
  3. Run mypy on the project
  4. See error

Expected behavior No mypy errors when using the library as expected

omri374 commented 4 weeks ago

Hi @tonyhammainen, we really do have a duplicate RecognizerResult in both packages, as the two packages are currently independent. The solution would be to have a base package which both packages use, but it's not on our roadmap to add this, as there is no functional benefit. Having said that, I don't see why this would be a mypy issue, as those packages are independent and each accepts its own type of recognizer result.

on the ad_hoc_recognizers object, I'm not sure, because as you said, PatternRecognizer is a child of EntityRecognizer

tonyhammainen commented 2 weeks ago

Hi @omri374, thanks for the reply!

The intended pattern is to use both libraries sequentially, first getting recognized PII out of analyzer, and inputting that for the anonymizer to anonymize. Or have I misunderstood?

If so, this means that we are inputting RecognizerResult of analyzer package to the anonymization function of anonymizer package, necessarily leading to mypy errors in the end-user app, if anonymizer package is only typed for its own version of RecognizerResults, which de-facto is something that nobody ever uses?

Tbh, I don't see why you are not including analyzer as a dependency of anonymizer, given the main usage pattern for anonymizer relies on analyzer?

If I have misunderstood how anonymizer package should be used do enlighten me!

eicca commented 4 days ago

I can confirm the issue with pyright (pylance) by using the first example from the quickstart: https://microsoft.github.io/presidio/getting_started/

If the libraries MUST be independent, then the conversion between the two classes should be done by the user. For example this way:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine, RecognizerResult

text = "My phone number is 212-555-5555"

# Set up the engine, loads the NLP module (spaCy model by default)
# and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text=text, entities=["PHONE_NUMBER"], language="en")
print(results)

# Convert recognizer results to RecognizerResult objects from presidio_anonymizer
results = [
    RecognizerResult(
        entity_type=result.entity_type,
        start=result.start,
        end=result.end,
        score=result.score,
    )
    for result in results
]

# Analyzer results are passed to the AnonymizerEngine for anonymization

anonymizer = AnonymizerEngine()

anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized_text)

Over time, the difference between RecognizerResult and RecognizerResult will grow, so the above solution provides a somewhat typed-checked way to keep our sanity. In fact, they are already different: the analyzer's RecognizerResult has extra information like recognition_metadata and analysis_explanation.

Another approach might be to refactor the anonymizer so that it depends on the protocol of the PIIEntity (right now it's an abstract class) instead of the concrete RecognizerResult implementation. This will keep the analyzer and anonymizer independent, but provide a bridge between two RecognizerResult classes. If it makes sense to you @omri374 and you see that it's doable - I could give it a try.

Additionally, I'd suggest deprecating one of the RecognizerResult (probably the one on the anonymizer side) and creating a different class name, just to avoid user confusion.

omri374 commented 4 days ago

Hi @eicca, thanks for your suggestions. We should also think of backward compatibilty here. In my view, the optimal solution is to have a shared library. One option is to have the anonymizer as a dependency of the analyzer. The second is to create a presidio-core library which is used by both.

eicca commented 3 days ago

Hi @eicca, thanks for your suggestions. We should also think of backward compatibilty here. In my view, the optimal solution is to have a shared library. The second is to create a presidio-core library which is used by both.

Thanks for your reply @omri374!

One option is to have the anonymizer as a dependency of the analyzer.

This totally makes sense to me, especially since it won't bring a lot of unnecessary dependencies for people who want to use only analyzer (in fact, just pycryptodome, which is most probably already required by other dependencies). I can try to prepare a PR if you're open to it.

The second is to create a presidio-core library which is used by both.

This also makes sense. This can also be done at a later time after the previous step is done.

omri374 commented 2 days ago

Thanks! A PR adding the anonymizer to the analyzer sounds great.