Closed balderman1 closed 4 years ago
Hi @balderman, Which regex engine are you using? pyre2 or regex?
I am using the python builtin regex (not re2)
We're looking into it.
I tried to get into the code and see what caused the bug.
Here are my understandings:
PatternRecognizer comes with an 'abstract' method def validate_result(self, pattern_text):
.
According to the method doc, this method is called after there was a match against the RegExp.
The implementation of CryptoRecognizer comes with an assumption that the matched string contains only chars that are defined in digits58
. This is a wrong assumption for some of the inputs that creates a match against the RegExp.
So I think we have 2 options here:
digits58
is wrong. def validate_result(self, pattern_text):
# try:
bcbytes = CryptoRecognizer.__decode_base58(pattern_text, 25)
result = bcbytes[-4:] == sha256(sha256(bcbytes[:-4])
.digest()).digest()[:4]
return result
@staticmethod
def __decode_base58(bc, length):
digits58 = '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'
n = 0
for char in bc:
n = n * 58 + digits58.index(char) # Exception is thrown here since the char in not in digits58
return n.to_bytes(length, 'big')
Below is a code that reproduce the issue.
The bottom line is that 0 is not included in digits58 but it is a part of the matched string
import re
REGEX = r'\b[13][a-km-zA-HJ-NP-Z0-9]{26,33}\b'
input_str = '/boardingPass/v1/devices/34e7b5e1a0aa1d6f3d862b52a289cdb7/registrations/pass.apoc.wallet/'
def __decode_base58(bc):
digits58 = '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'
n = 0
for char in bc:
print('current char is: {}'.format(char))
try:
n = n * 58 + digits58.index(char)
except ValueError as e:
print('Exception was thrown for char {}. It means that the char is "{}" is not a char in "digits58"'.format(
char, char))
matches = re.finditer(
REGEX,
input_str,
flags=re.IGNORECASE | re.DOTALL | re.MULTILINE)
for match in matches:
start, end = match.span()
current_match = input_str[start:end]
print('\t Match:' + current_match)
__decode_base58(current_match)
print('done')
output
Match:34e7b5e1a0aa1d6f3d862b52a289cdb7
current char is: 3
current char is: 4
current char is: e
current char is: 7
current char is: b
current char is: 5
current char is: e
current char is: 1
current char is: a
current char is: 0
Exception was thrown for char 0. It means that the char is "0" is not a char in "digits58"
current char is: a
current char is: a
current char is: 1
current char is: d
current char is: 6
current char is: f
current char is: 3
current char is: d
current char is: 8
current char is: 6
current char is: 2
current char is: b
current char is: 5
current char is: 2
current char is: a
current char is: 2
current char is: 8
current char is: 9
current char is: c
current char is: d
current char is: b
current char is: 7
done
My current (brutal..) workaround is
engine = AnalyzerEngine()
del engine.registry.recognizers[1] # CryptoRecognizer is the second item in the list
If any of the recognizers fail for any reason and you don't need the PII entity supported by that recognizer, the right solution would be to call the analyzer with the set of entities you're interested in. The analyzer would not run a recognizer if the entity type it detects is not requested.
For example, by adding a template with the set of requested fields:
echo -n '{"fields":[{"name":"PHONE_NUMBER"}, {"name":"LOCATION"}, {"name":"DATE_TIME"}]}' | http <api-service-address>/api/v1/templates/<my-project>/analyze/<my-template-name>
instead of
echo -n {"allFields":true}' | http <api-service-address>/api/v1/templates/<my-project>/analyze/<my-template-name>
When calling the engine analyze API like
and the value of 'text_to_analyze' is
"/boardingPass/v1/devices/34e7b5e1a0aa1d6f3d862b52a289cdb7/registrations/pass.apoc.wallet/"
The exception below is thrown
File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/analyzer_engine.py", line 204, in analyze current_results = recognizer.analyze(text, entities, nlp_artifacts) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/pattern_recognizer.py", line 61, in analyze pattern_result = self.__analyze_patterns(text) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/pattern_recognizer.py", line 144, in __analyze_patterns validation_result = self.validate_result(current_match) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/predefined_recognizers/crypto_recognizer.py", line 23, in validate_result bcbytes = CryptoRecognizer.__decode_base58(pattern_text, 25) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/predefined_recognizers/crypto_recognizer.py", line 33, in __decode_base58 n = n * 58 + digits58.index(char)
ValueError: substring not found