Could IPLD Schemas include complex qualifiers and constraints, like regexps?

Maybe!

(Meta: This may spawn an exploration report or other docs, but I'm starting with an issue for now, as it's a early thought.)

Previously, I've been opposed to this. A primary goal of IPLD Schemas is that they must be fast (and have predictable time costs) to compute whether or not the "match" some data. More powerful forms of pattern matching make this harder and harder, as does matching that gets more and more granular. It is for this reason that Schemas are built on a known set of basic structural patterns, chosen to be simple and predictable to match against, favoring patterns that can also be matched streamingly (e.g. without backtracking and the time/memory costs that non-streaming operation would imply), and does all matching using purely structural elements that are easy to examine in terms of IPLD Data Model Kinds alone; and in the rare case values are considered at all (such as keyed unions, or structure field names, etc) this operates by direct equality check (never any pattern matching inside the value data itself). Introducing complex qualifiers like regexps would seem to run against all of this: they inspect values deeply; they have more-complex-to-predict time costs; they generally make things much more complex; and so on.

But...

What if we introduced complex qualifiers (such as regexps for example) only as validators, and not as constraints used for schema matching?

The distinction may seem subtle, but the user story is this: you couldn't use regexps to describe protocol migration/evolution conditions (because libraries won't help you: any "TrySchemaStack(data, [schemaList]) (typedData | error)" helpers will error out and return completely on a validator fail, rather than proceed to probe for matching on additional schemas)... But, you could freely use regexps to describe and document rules about the data that are applied when the schema does match.

This could give us more power to fulfill our goals of providing a consistent and language-agnostic place to author data structure design documents, while not compromising our goals regarding protocol evolution and the critical role of flunk-fast unification to enable that.

Additionally, this delineation keeps good incrementality for schema library and tooling authors. Because it's clear that the complex qualifiers like regexps aren't needed for critical core functionality like determining if a schema matches some data, it's then pretty easy to say that such complex qualifiers be implemented "later" or "not in the mvp/v1".

I've used regexps as the example through this discussion so far, but we could equally well be talking about integer range constraints. Both are things we'd occasionally like to have.

Notably not considering:

any form of "unification" of complex qualifiers. This is a fascinating topic but both tricky to generalize and simply hard work, and I don't see it as sufficiently applicable to us to be worth pursuing. (See "CUE" for a system that does pursue this though; it's neat and interesting and may be worth it for them in their context.)
any detection of logically conflictory complex qualifiers, e.g. two string types with not-very-distinguishable regexp validator rules attached to them which are composed into a struct with stringjoin representation with an even-less-distinguishable join character... yeah, no. If someone wants to make tools that try to do that, more power to them, but I don't see this having very great cost/benefit tradeoffs.

Caveats remain:

regexp implementations are not perfectly consistent between all languages, and this could be barrels of fun. However, I think they're often "close enough" that we can actually roll with this. Additionally, mitigation paths can be designed: adjunct config is a reasonable place to do this. And finally, the separation of validator versus involved-in-matching-determinations greatly minimizes the potential for problematic outcomes (something involved-in-matching can change systemic behavior significantly by changing what a TrySchemaStack operation returns, which can generally semantic consequences, and thus be pretty concerning; something thats only possible effect is goto-halt is much less capable of generating problems).

Questions remain:

do this on types... Or even fields?
- no, scratch that. Named types only. Done.
what on earth should the schema DSL syntax, as well as the reified DM of it, look like?
- I have no idea. But it does seem like most of this would go in the schema document rather than in adjunct config (since the whole point is getting away from per-consuming-project needs to manually reimplement the qualifiers; and it should certainly work on runtime modes as well as codegenned)... So, we need SOME syntax, for sure. Maybe it's time to start rolling with something a bit like java annotations syntax?
deriving errors: for regexps in particular(!), schema authors are very likely to want to define human-readable error messages to associate with failure to match a particular regexp. (Or just an i18n keystring if feeling frisky; whatever, not the scope and focus of this document.) There might even be more than one of them (with increasingly specific regexps?), or they might want to embed values from matches (oof!). How do we support this? (Do we?)
- A simple linear list of verifiers and an error message to associate with their failure would probably cover a ton of ground. I haven't thought this through exhaustively yet, though.
how would these features interact with the idea of hashes of schemas being useful for any application level logic around schemas?
- There might not be all that much to worry about here: the "convergence" tendencies in schemas are already extremely minimal, since one of their most central features -- type names -- are a local-only semantic that's not load-bearing other than it must suit local referential rules. So, there's already very little (if any) load that bears on the idea of hashes of schemas being used for much other than just plain document ID.

This is an early thought, but I wanted to get it out there. The idea of creating separate phases for matching versus (additional) validating seems to open up a potential avenue for solutions I previously would've rejected, and that's probably worth some further thought.

ipld / specs

Could IPLD Schemas include complex qualifiers and constraints, like regexps? #275