Open ff137 opened 1 year ago
Regex patterns can be computationally expensive ... difficult to read / maintain ... so my proposal would be to simplify as much as possible, and deduplicate the implementation, so the pattern is defined once and imported where needed.
@swcurran @dbluhm I propose creating an implementation in the following vein (thanks GPT - to be confirmed):
# Define the smaller components of the regex
DID_SOV_PATTERN = r"did:sov:[123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz]{21,22}"
DID_METHOD = r"did:(\w+)"
DID_IDENTIFIER = r"([\w.-]+(:[\w.-]+)*)"
DID_PARAMETERS = r"(;[\w.:%-]+=[\w.:%-]*)*"
DID_PATH = r"(\\/[^#?]*)?"
DID_QUERY = r"([?][^#]*)?"
DID_FRAGMENT = r"(\#.*)?"
# Combine the components into the full regex
DID_PATTERN = re.compile("^({}|{}:{}{}{}{}{}{})$".format(
DID_SOV_PATTERN,
DID_METHOD,
DID_IDENTIFIER,
DID_PARAMETERS,
DID_PATH,
DID_QUERY,
DID_FRAGMENT
))
This makes it more readable / maintainable, and can deduplicate the implementation throughout the codebase.
Thoughts? Can I get cracking on this?
The did:sov
identifier pattern ([123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz]
) can theoretically be simplified: [1-9A-HJ-NP-Za-km-z]
- because that uses character ranges, and more clearly omits the characters that should be ignored: 0, O, l, I
.
Alternatively, a negative lookahead pattern can be used: (?![0OIl])[1-9A-Za-z]{21,22}
. Which may or not be more computationally efficient. So I can do some tests to compare runtime of the two options.
Idk ... makes it slightly clearer which characters are omitted. Code is read much more than it is written, so I think it's worth it, but any feedback will be appreciated
@dbluhm Maybe I can implement this in pydid
? i.e. define the regex patterns for sov and generic methods in validation.py, then import and use in ACA validators
@ff137 You make some good points and raise some good questions. Whether ACA-Py used it or not, I would not mind having a clean, readable set of importable DID regex patterns in PyDID. I don't think I'd want to include anything specific to a DID method though, unless it was a really common pattern across many DID methods (like a network identifier on the front of the method specific id; I think that's in at least a few DID methods).
I'm open to any improvement that makes these patterns easier to read and maintain :slightly_smiling_face:
Re: the validator being implemented in some places and not in others, I would describe ACA-Py as being in a transition period right now. Things like connections still expect legacy DID and DID Doc representations. @Jsyro is working on getting this updated to using peer DIDs (see #2249). Previously, I would say we were a bit too cautious about making breaking changes; I don't think that was totally unwarranted. The DID Core spec was still very much shaking out around the time ACA-Py was getting support for the connections protocol. However, even while connections has been using legacy DIDs, other features were added that didn't require that same caution and so they implemented full support for "modern" DIDs and DID Docs.
Eventually, everything should move to "real" qualified DIDs. That process will look different for each of those models. For instance, the IssuerRevRegRecord
is actually in the process of being more or less deprecated on the anoncreds-rs branch as we work on adding ledger agnostic AnonCreds support to ACA-Py. The Credential exchange related models should be updatable as we progress on the AnonCreds effort, too. The connection related models should see updates with peer DID work.
Updating the validators would probably not break things but it would be an incomplete change since support for a broader set of DIDs and DID Methods isn't here quite yet on the backend in most cases.
Love the use of ChatGPT and the suggestions, and I really like the idea of the assembling of the expression from its components. Putting in some of the text overview of the expressions in the code is a good idea - that could make trying to clarify the “missing characters” easier, for example. I assume the did:sov
pattern is there for performance, even though it shouldn’t be needed. It would be interesting to know what difference it makes in performance — or if we can ignore performance entirely.
I assume it should be used in all the places it belongs. I leave it to y’all to decide what’s next on this.
Nice work!
I'm thinking there's two parts to this problem:
If the backend that uses a ConnRecord
, for example, can only function with the did:sov method, then that should be documented and indeed validate for only did:sov keys in the ConnRecord
body. Expanding the API to support any did method is beyond the scope of this issue, and should be tracked elsewhere.
Here, I'm mainly concerned with how the validation should be done.
When looking at the new regex pattern, what first jumped out to me was the $$, so I thought something may be faulty and asked GPT. What also concerned me was all the uses of the greedy quantifiers *
, which tries to match between 0 and unlimited times ... greedy matching can be dangerous and exploited in ReDoS attacks (Regular expression Denial of Service), where malicious actors send requests with incredibly long DID strings, exploiting the regex validator and hanging the system.
The way it's implemented is not necessarily a vulnerability, but it's also not clear how it will perform when it's abused. So that's actually what motivated me to post the issue, before thinking about the other points. I think the computational efficiency of the validation should also be a prime concern.
With that said, I would propose moving the DID validator logic to pydid
, by implementing the did:sov base 58 pattern there. Regex matching for that pattern is of course fine because it has a defined char set and the identifier must be 21/22 characters.
For the generic DID method validation, I would propose using a different methodology, using a more functional approach that cannot be exploited. In the next day or two I can open a PR in pydid
to implement something along those lines. Just wanted to share my thoughts in the meantime
The 0.9.0 release introduces a new regex pattern in the DID validator for some request/response models:
For anyone who wants help reading this, here's a ChatGPT summary:
A few observations:
the regex ends in
$$
. I presume this is an oversight, because a single$
is used to indicate the end of the stringthe expressions
[a-zA-Z0-9_]
(used 5 times) can be simplified with\w
, as they are identical, making it more readable.This may not be needed in python, but regex101.com suggests:
/ An unescaped delimiter must be escaped; in most languages with a backslash (\)
(\\/[^#?]*)
, which may need to be escaped:(\\\/[^#?]*)
the
?
after(did:sov:)
means that the prefix isn't required and will also match on just the 21/22 character identifier. Is this desired behavior?Lastly, this regex / did validator update was implemented in some places and not others. It was not added in the validators for the following models:
ConnRecord
ConnectionStaticRequest
/ConnectionStaticResult
CredentialProposal
DIDEndpoint
/DIDEndpointWithType
DIDXRequest
IndyCredRequest
IssuerRevRegRecord
V10CredentialCreate
/V10CredentialProposalRequestMand
/V10CredentialProposalRequestOpt
V20CredFilterIndy
... in other words, those models still only have DID validators for did:sov, and not other methods. Is that intended?
I'm happy to contribute the changes to improve on this, but I'll please need feedback on whether those models were intentionally kept with only did:sov validation, or if the expanded did validation can be applied there too.