microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.89k stars 579 forks source link

crypto_recognizer throws an exception #241

Closed balderman1 closed 4 years ago

balderman1 commented 5 years ago

When calling the engine analyze API like

 response = engine.analyze(correlation_id=0,
                                      text=text_to_analyze,
                                      language='en',
                                      entities=[],
                                      all_fields=True,
                                      score_threshold=0.5)

and the value of 'text_to_analyze' is

"/boardingPass/v1/devices/34e7b5e1a0aa1d6f3d862b52a289cdb7/registrations/pass.apoc.wallet/"

The exception below is thrown

File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/analyzer_engine.py", line 204, in analyze current_results = recognizer.analyze(text, entities, nlp_artifacts) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/pattern_recognizer.py", line 61, in analyze pattern_result = self.__analyze_patterns(text) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/pattern_recognizer.py", line 144, in __analyze_patterns validation_result = self.validate_result(current_match) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/predefined_recognizers/crypto_recognizer.py", line 23, in validate_result bcbytes = CryptoRecognizer.__decode_base58(pattern_text, 25) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/predefined_recognizers/crypto_recognizer.py", line 33, in __decode_base58 n = n * 58 + digits58.index(char)

ValueError: substring not found

omri374 commented 5 years ago

Hi @balderman, Which regex engine are you using? pyre2 or regex?

balderman1 commented 5 years ago

I am using the python builtin regex (not re2)

omri374 commented 5 years ago

We're looking into it.

balderman1 commented 4 years ago

I tried to get into the code and see what caused the bug. Here are my understandings: PatternRecognizer comes with an 'abstract' method def validate_result(self, pattern_text):. According to the method doc, this method is called after there was a match against the RegExp. The implementation of CryptoRecognizer comes with an assumption that the matched string contains only chars that are defined in digits58. This is a wrong assumption for some of the inputs that creates a match against the RegExp. So I think we have 2 options here:

  1. The RegExp creates a false match and this is the part that should be fixed OR
  2. The RegExp creates a true match but the assumption that the matched string must contain only chars that are defined in digits58 is wrong.
    def validate_result(self, pattern_text):
        # try:
        bcbytes = CryptoRecognizer.__decode_base58(pattern_text, 25)
        result = bcbytes[-4:] == sha256(sha256(bcbytes[:-4])
                                        .digest()).digest()[:4]
        return result

    @staticmethod
    def __decode_base58(bc, length):
        digits58 = '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'
        n = 0
        for char in bc:
            n = n * 58 + digits58.index(char) # Exception is thrown here since the char in not in digits58 
        return n.to_bytes(length, 'big')
balderman1 commented 4 years ago

Below is a code that reproduce the issue.

The bottom line is that 0 is not included in digits58 but it is a part of the matched string

import re

REGEX = r'\b[13][a-km-zA-HJ-NP-Z0-9]{26,33}\b'
input_str = '/boardingPass/v1/devices/34e7b5e1a0aa1d6f3d862b52a289cdb7/registrations/pass.apoc.wallet/'

def __decode_base58(bc):
    digits58 = '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'
    n = 0
    for char in bc:
        print('current char is: {}'.format(char))
        try:
            n = n * 58 + digits58.index(char)
        except ValueError as e:
            print('Exception was thrown for char {}. It means that the char is "{}" is not a char in "digits58"'.format(
                char, char))

matches = re.finditer(
    REGEX,
    input_str,
    flags=re.IGNORECASE | re.DOTALL | re.MULTILINE)

for match in matches:
    start, end = match.span()
    current_match = input_str[start:end]
    print('\t Match:' + current_match)
    __decode_base58(current_match)
print('done')

output

     Match:34e7b5e1a0aa1d6f3d862b52a289cdb7
current char is: 3
current char is: 4
current char is: e
current char is: 7
current char is: b
current char is: 5
current char is: e
current char is: 1
current char is: a
current char is: 0
Exception was thrown for char 0. It means that the char is "0" is not a char in "digits58"
current char is: a
current char is: a
current char is: 1
current char is: d
current char is: 6
current char is: f
current char is: 3
current char is: d
current char is: 8
current char is: 6
current char is: 2
current char is: b
current char is: 5
current char is: 2
current char is: a
current char is: 2
current char is: 8
current char is: 9
current char is: c
current char is: d
current char is: b
current char is: 7
done
balderman1 commented 4 years ago

My current (brutal..) workaround is

engine = AnalyzerEngine()
del engine.registry.recognizers[1] # CryptoRecognizer is the second item in the list
omri374 commented 4 years ago

If any of the recognizers fail for any reason and you don't need the PII entity supported by that recognizer, the right solution would be to call the analyzer with the set of entities you're interested in. The analyzer would not run a recognizer if the entity type it detects is not requested.

For example, by adding a template with the set of requested fields:

echo -n '{"fields":[{"name":"PHONE_NUMBER"}, {"name":"LOCATION"}, {"name":"DATE_TIME"}]}' | http <api-service-address>/api/v1/templates/<my-project>/analyze/<my-template-name>

instead of


echo -n {"allFields":true}' | http <api-service-address>/api/v1/templates/<my-project>/analyze/<my-template-name>