Regex pattern used in DID validation

The 0.9.0 release introduces a new regex pattern in the DID validator for some request/response models:

pattern = r"^(did:sov:)?[123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz]{21,22}$|^did:([a-zA-Z0-9_]+):([a-zA-Z0-9_.%-]+(:[a-zA-Z0-9_.%-]+)*)((;[a-zA-Z0-9_.:%-]+=[a-zA-Z0-9_.:%-]*)*)(\\/[^#?]*)?([?][^#]*)?(\#.*)?$$"

For anyone who wants help reading this, here's a ChatGPT summary:

This regular expression pattern is designed to match the string format of a Decentralized Identifier (DID), as defined in the DID specification. Let's break it down to understand it:

^(did:sov:)?[123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz]{21,22}$: This part matches DIDs that are specifically using the Sovrin method (did:sov:). It allows for this prefix to be optional, as indicated by the ? after the group. The {21,22} specifies that the DID identifier itself should be 21 or 22 characters long, and only certain alphanumeric characters are allowed (note that certain characters like '0', 'I', 'O' and 'l' are omitted, probably to avoid confusion).

^did:([a-zA-Z0-9_]+):([a-zA-Z0-9_.%-]+(:[a-zA-Z0-9_.%-]+)*)((;[a-zA-Z0-9_.:%-]+=[a-zA-Z0-9_.:%-]*)*)(\\/[^#?]*)?([?][^#]*)?(\#.*)?$$: This part matches DIDs that use other methods. It is broken down as follows:

did:: The DID must start with 'did:'

([a-zA-Z0-9_]+):: The method used in the DID, it could be alphanumeric or underscore.

([a-zA-Z0-9_.%-]+(:[a-zA-Z0-9_.%-]+)*): The identifier part of the DID, followed by optional additional parts separated by colons.

(;[a-zA-Z0-9_.:%-]+=[a-zA-Z0-9_.:%-]*)*: Optional DID parameters, each beginning with a semicolon, following a key=value format.

(\\/[^#?]*)?: Optional path, beginning with a slash and not containing any # or ?.

([?][^#]*)?: Optional query string, beginning with a question mark and not containing any #.

(\#.*)?: Optional fragment, beginning with a #.

$$: An apparent mistake, which should probably be just a single $ to denote the end of the string.

As you can see, the pattern indeed appears overcomplicated and there are parts that might be improved. For example, the use of $$ at the end seems like a mistake; it should be just a $ to denote the end of the string.

The regular expression could possibly be simplified further, but it depends on the exact rules of the DID structure that you're looking to enforce. If you need to enforce all the rules as they are currently defined, the regex might already be as simple as it can be. But if you're willing to relax some of those rules (e.g. the specific character set restrictions or the optional parts), it could be made simpler.

In conclusion, this regular expression is already quite optimized given the complexity of the DID specification, with the possible exception of the $$ at the end, which should probably be a single $. If the DIDs you're working with don't require this level of complexity, you might be able to simplify the regular expression further.

A few observations:

the regex ends in $$. I presume this is an oversight, because a single $ is used to indicate the end of the string
the expressions [a-zA-Z0-9_] (used 5 times) can be simplified with \w, as they are identical, making it more readable.
This may not be needed in python, but regex101.com suggests:
- / An unescaped delimiter must be escaped; in most languages with a backslash (\)
- This refers to the / character in the block (\\/[^#?]*), which may need to be escaped: (\\\/[^#?]*)
the ? after (did:sov:) means that the prefix isn't required and will also match on just the 21/22 character identifier. Is this desired behavior?

Lastly, this regex / did validator update was implemented in some places and not others. It was not added in the validators for the following models:

ConnRecord
ConnectionStaticRequest / ConnectionStaticResult
CredentialProposal
DIDEndpoint / DIDEndpointWithType
DIDXRequest
IndyCredRequest
IssuerRevRegRecord
V10CredentialCreate / V10CredentialProposalRequestMand / V10CredentialProposalRequestOpt
V20CredFilterIndy

... in other words, those models still only have DID validators for did:sov, and not other methods. Is that intended?

I'm happy to contribute the changes to improve on this, but I'll please need feedback on whether those models were intentionally kept with only did:sov validation, or if the expanded did validation can be applied there too.

Regex patterns can be computationally expensive ... difficult to read / maintain ... so my proposal would be to simplify as much as possible, and deduplicate the implementation, so the pattern is defined once and imported where needed.

@swcurran @dbluhm I propose creating an implementation in the following vein (thanks GPT - to be confirmed):

# Define the smaller components of the regex
DID_SOV_PATTERN = r"did:sov:[123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz]{21,22}"
DID_METHOD = r"did:(\w+)"
DID_IDENTIFIER = r"([\w.-]+(:[\w.-]+)*)"
DID_PARAMETERS = r"(;[\w.:%-]+=[\w.:%-]*)*"
DID_PATH = r"(\\/[^#?]*)?"
DID_QUERY = r"([?][^#]*)?"
DID_FRAGMENT = r"(\#.*)?"

# Combine the components into the full regex
DID_PATTERN = re.compile("^({}|{}:{}{}{}{}{}{})$".format(
    DID_SOV_PATTERN,
    DID_METHOD,
    DID_IDENTIFIER,
    DID_PARAMETERS,
    DID_PATH,
    DID_QUERY,
    DID_FRAGMENT
))

This makes it more readable / maintainable, and can deduplicate the implementation throughout the codebase.

Thoughts? Can I get cracking on this?

The did:sov identifier pattern ([123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz]) can theoretically be simplified: [1-9A-HJ-NP-Za-km-z] - because that uses character ranges, and more clearly omits the characters that should be ignored: 0, O, l, I.

Alternatively, a negative lookahead pattern can be used: (?![0OIl])[1-9A-Za-z]{21,22}. Which may or not be more computationally efficient. So I can do some tests to compare runtime of the two options.

Idk ... makes it slightly clearer which characters are omitted. Code is read much more than it is written, so I think it's worth it, but any feedback will be appreciated

@dbluhm Maybe I can implement this in pydid? i.e. define the regex patterns for sov and generic methods in validation.py, then import and use in ACA validators

@ff137 You make some good points and raise some good questions. Whether ACA-Py used it or not, I would not mind having a clean, readable set of importable DID regex patterns in PyDID. I don't think I'd want to include anything specific to a DID method though, unless it was a really common pattern across many DID methods (like a network identifier on the front of the method specific id; I think that's in at least a few DID methods).

I'm open to any improvement that makes these patterns easier to read and maintain :slightly_smiling_face:

Re: the validator being implemented in some places and not in others, I would describe ACA-Py as being in a transition period right now. Things like connections still expect legacy DID and DID Doc representations. @Jsyro is working on getting this updated to using peer DIDs (see #2249). Previously, I would say we were a bit too cautious about making breaking changes; I don't think that was totally unwarranted. The DID Core spec was still very much shaking out around the time ACA-Py was getting support for the connections protocol. However, even while connections has been using legacy DIDs, other features were added that didn't require that same caution and so they implemented full support for "modern" DIDs and DID Docs.

Eventually, everything should move to "real" qualified DIDs. That process will look different for each of those models. For instance, the IssuerRevRegRecord is actually in the process of being more or less deprecated on the anoncreds-rs branch as we work on adding ledger agnostic AnonCreds support to ACA-Py. The Credential exchange related models should be updatable as we progress on the AnonCreds effort, too. The connection related models should see updates with peer DID work.

Updating the validators would probably not break things but it would be an incomplete change since support for a broader set of DIDs and DID Methods isn't here quite yet on the backend in most cases.

Love the use of ChatGPT and the suggestions, and I really like the idea of the assembling of the expression from its components. Putting in some of the text overview of the expressions in the code is a good idea - that could make trying to clarify the “missing characters” easier, for example. I assume the did:sov pattern is there for performance, even though it shouldn’t be needed. It would be interesting to know what difference it makes in performance — or if we can ignore performance entirely.

I assume it should be used in all the places it belongs. I leave it to y’all to decide what’s next on this.

Nice work!

I'm thinking there's two parts to this problem:

which did methods are supported in the backend,
and how did method validation is enforced in the request/response schemas.

If the backend that uses a ConnRecord, for example, can only function with the did:sov method, then that should be documented and indeed validate for only did:sov keys in the ConnRecord body. Expanding the API to support any did method is beyond the scope of this issue, and should be tracked elsewhere.

Here, I'm mainly concerned with how the validation should be done.

When looking at the new regex pattern, what first jumped out to me was the $$, so I thought something may be faulty and asked GPT. What also concerned me was all the uses of the greedy quantifiers *, which tries to match between 0 and unlimited times ... greedy matching can be dangerous and exploited in ReDoS attacks (Regular expression Denial of Service), where malicious actors send requests with incredibly long DID strings, exploiting the regex validator and hanging the system. The way it's implemented is not necessarily a vulnerability, but it's also not clear how it will perform when it's abused. So that's actually what motivated me to post the issue, before thinking about the other points. I think the computational efficiency of the validation should also be a prime concern.

With that said, I would propose moving the DID validator logic to pydid, by implementing the did:sov base 58 pattern there. Regex matching for that pattern is of course fine because it has a defined char set and the identifier must be 21/22 characters.

For the generic DID method validation, I would propose using a different methodology, using a more functional approach that cannot be exploited. In the next day or two I can open a PR in pydid to implement something along those lines. Just wanted to share my thoughts in the meantime

hyperledger / aries-cloudagent-python

Regex pattern used in DID validation #2347