microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.83k stars 574 forks source link

OverflowError in crypto_recognizer #1376

Closed udayan14 closed 6 months ago

udayan14 commented 6 months ago

Describe the bug

On certain text inputs, analyze method throws OverflowError

To Reproduce

Here's the environment setup:

Sample script that throws exception when run:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

default_lang = "en_simple"
supported_languages = ["en_simple"]
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "en_simple", "model_name": "en_core_web_sm"},
    ],
}
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, supported_languages=supported_languages
)

analyzer.analyze(
    text='{"awsAccountId":"327878933619","digestStartTime":"2023-10-15T22:04:04Z","digestEndTime":"2023-10-15T23:04:04Z","digestS3Bucket":"paul-trail","digestS3Object":"AWSLogs\/327878933619\/CloudTrail-Digest\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail-Digest_ap-northeast-1_paul-trail_us-west-2_20231015T230404Z.json.gz","digestPublicKeyFingerprint":"be2f0b997552f44942837300ba1aba9d","digestSignatureAlgorithm":"SHA256withRSA","newestEventTime":"2023-10-15T22:58:17Z","oldestEventTime":"2023-10-15T22:04:51Z","previousDigestS3Bucket":"paul-trail","previousDigestS3Object":"AWSLogs\/327878933619\/CloudTrail-Digest\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail-Digest_ap-northeast-1_paul-trail_us-west-2_20231015T220404Z.json.gz","previousDigestHashValue":"8f953371d3e85eddb89b05ed6b9e680791055315c73e1025ab5dba7bb2aee189","previousDigestHashAlgorithm":"SHA-256","previousDigestSignature":"11c11e253f4929eaded49c9d826b257a5ab894ce002988bd07ed2bc6407f1b0ef74f48634c364c6884c6470c9416d73f0742f8758746fc8db4cf23b75c713304779bb6d181ccae4b6a78ae5106f1602ce49af3f9dea4e9ba92761fcaf3e02a5f3d64558d7f4b2eff85f0cc523a770a3b1092e0e37aa665f3c37b75ecc93c94a4640825e0ebe44b2b4fa48b7477040f08a83db2224b403c46476ca25a1b53b5b5db86be04e623fef2d9a2a8eba482239439d6d49cb5eb759a90184f72506a8788fb085f56830c46f51d6e216152bf9156b33cbbee3aeeb5b00540f333708f870d316291f37dd530491a7785ddafdb83543c327fa504e200efefbadd644fed9b9a","logFiles":[{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2205Z_iRIoDMA9l9Q4kmFy.json.gz","hashValue":"4309c6161e37538de72ec6f679e86b7e45aebed71fa7e76af70c3019fef44e19","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:04:51Z","oldestEventTime":"2023-10-15T22:04:51Z"},{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2300Z_aDYIgZODwysx0Irn.json.gz","hashValue":"de90c3b55016bc5fad9c12378ccc6fc38180a15bd95879305415572a4472b1a9","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:58:17Z","oldestEventTime":"2023-10-15T22:58:17Z"},{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2300Z_9eJ8qdKnXIfFg2wM.json.gz","hashValue":"85e79f9b40d5a57be15fa6ac6f54d3ea1919611e37ca682c1e753287ac7b9bcb","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:58:17Z","oldestEventTime":"2023-10-15T22:58:17Z"},{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2225Z_OviGSSWadUI1W1r7.json.gz","hashValue":"58583ed7d52597e47e073db9b756f38815a8a5aff92911911710f18e65e1c44d","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:20:34Z","oldestEventTime":"2023-10-15T22:10:12Z"},{"s3Bucket":"paul-trail","s3Object":"AWSLogs\/327878933619\/CloudTrail\/ap-northeast-1\/2023\/10\/15\/327878933619_CloudTrail_ap-northeast-1_20231015T2225Z_j5hj9VuYmchJHAkK.json.gz","hashValue":"c18c49161f97def10a14cffa5b5ab441c8fe8194af1cb1d79d470b6173f901c4","hashAlgorithm":"SHA-256","newestEventTime":"2023-10-15T22:20:34Z","oldestEventTime":"2023-10-15T22:20:34Z"}]}',
    language=default_lang,
)

Here's the stack trace:

  File "/Users/udayan/Desktop/theom/presidio/presidio_venv/lib/python3.12/site-packages/presidio_analyzer/analyzer_engine.py", line 207, in analyze
    current_results = recognizer.analyze(
                      ^^^^^^^^^^^^^^^^^^^
  File "/Users/udayan/Desktop/theom/presidio/presidio_venv/lib/python3.12/site-packages/presidio_analyzer/pattern_recognizer.py", line 97, in analyze
    pattern_result = self.__analyze_patterns(text, regex_flags)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/udayan/Desktop/theom/presidio/presidio_venv/lib/python3.12/site-packages/presidio_analyzer/pattern_recognizer.py", line 210, in __analyze_patterns
    validation_result = self.validate_result(current_match)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/udayan/Desktop/theom/presidio/presidio_venv/lib/python3.12/site-packages/presidio_analyzer/predefined_recognizers/crypto_recognizer.py", line 62, in validate_result
    bcbytes = self.__decode_base58(pattern_text, 25)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/udayan/Desktop/theom/presidio/presidio_venv/lib/python3.12/site-packages/presidio_analyzer/predefined_recognizers/crypto_recognizer.py", line 79, in __decode_base58
    return n.to_bytes(length, "big")
           ^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: int too big to convert

Expected behavior

code should run without any errors, or at least the exception should be documented

Additional context

There might be a smaller input that causes the same error, but this is what I have right now.

larissaleite commented 5 months ago

Hi! Thank you so much for the fix 💪 when there will be a release to include this change?

omri374 commented 5 months ago

Hi @larissaleite, we're waiting for a few more additions before the next release. In the mean time, I would suggest to copy the crypto recognizer's code from github and add it as a new recognizer instead of the existing one, or remove the crypto recognizer if you don't expect to have crypto entities in your datasets.

idelarosa3232 commented 5 months ago

@omri374 , do you have a date estimate on when the next release with the fix would be available? I am also experiencing this issue but can't add custom recognizers because I use presidio with docker and call it via the /analyze API

Hi @larissaleite, we're waiting for a few more additions before the next release. In the mean time, I would suggest to copy the crypto recognizer's code from github and add it as a new recognizer instead of the existing one, or remove the crypto recognizer if you don't expect to have crypto entities in your datasets.

omri374 commented 5 months ago

@idelarosa3232 this would likely take 2-3 weeks. If you need help creating a docker image with custom recognizers, please let us know here or via email (presidio@microsoft.com)

idelarosa3232 commented 4 months ago

@omri374 looks like this bug was not fixed yet, can the issue be reopened?