Afform Auditor: Defining schemas

The auditor is a component which identifies issues in stored forms - which, in turn, involves two general areas:

Schema/ruleset design - How do you define "valid" and "invalid"? (Ex: Create a whitelist of supported tags.)
User/developer experience - When should the forms be checked for validity, and how should this be communicated? (Ex: Check the rulesets after any extensions are upgraded.)

This issue is specifically focusing on the schema/ruleset side.

Example Rules and Mental Model

Before we dig into specific tools or algorithms, let's consider some rules we'd like to articulate in English:

Any tag, attribute, or CSS-class that isn't explicitly whitelisted should generate a warning.
Any afform created by an extension or via API becomes a component/directive/tag that is available for use in others.
For forms customized in the GUI, the <af-model-list> must be present in the top-level form.
The following standard HTML tags are allowed for general organization:
- div span p h1 h2 h3 h4 h5 h6 fieldset label pre blockquote
The following standard HTML tags are allowed for styling of text: a strong em tt code sub sup
The following standard HTML tags should raise a warning:
- b i center
The following standard Angular directives are allowed anywhere:
- ng-if ng-show ng-hide ng-classes
The following standard Angular directives are allowed on links and buttons:
- ng-click
The following Afform directives are allowed for working with data:
- af-model-list af-model-prop af-model af-field aff-api4-action
The following BootstrapCSS classes are allowed, provided they are put on <a> or `

In considering these rules, it stands out to me that we have distinct sets of rules which can be combined. Thus, the "set of rules for user-editable forms" is the result of combining the "set of rules for basic HTML" plus "set of rules for basic AngularJS" plus "set of rules for BootstrapCSS" plus "set of rules for Afform data-handling".

Existing Standards

Tools for validating SGML/HTML/XML have been around as long as SGML/HTML/XML have been around. There are three widely used standards for defining XML rules: Document Type Definition (DTD), XML Schema Definition (XSD), and RelaxNG (RNG).

These three systems have an obvious strength: there are many tools, libraries, tutorials, books, stackexchange questions, etc which deal with them. You will find many examples of how to create the schema for a document like this:

<library>
   <book>
     <author><name>Lewis Carroll</name><dob>1832-01-27</dob></author>
     <title>Alice in Wonderland</title>
   </book>
</library>

I started refreshing myself a bit on these - and, in particular, I liked this RelaxNG book. Two interesting things:

RelaxNG supports a "compact" notation - which I find more readable than DTD or XSD.
RelxaNG is positioned as the more extensible of the three. For example, consider this snippet (chap 10): an upstream provider has defined an original schema in library.rnc, and a downstream consumer creates a custom variant of the schema in which element-name (i.e. any <name> tag) must have text content with less than 80 characters.
```
include "library.rnc" {
   element-name = element name { xsd:token{maxLength = "80"} }
}
```

So... can we address the bulk of Afform validation by delegating out to a standard library and making a few XML config files? I initially hoped so, but I'm starting that something more is needed:

The developer-stories for DTD, XSD, and RNG all begin with creating a new dialect top-down. This feels right if you're designing the contract for a web-service. But in this case, we're actually appropriating an existing dialect (HTML) and mixing-in a set of changes based on policy/configuration/data. RNG is more plausible here than DTD, but even there it feels like there's something missing.
IMHO, there's a functional need around HTML class validation. I haven't heard of or imagined a way to use any common XML schema standard to effectively model HTML class constraints. (CSS notation does this better...)
The af-field, af-model, af-model-prop should express field-names/entity-names/entity-types which match. I can see how to enforce this with, eg, PHP logic - but I'm struggling to see how to express those constraints.
To my understanding, the validators are rather binary - either the document passes or fails. In reality, I think it's valuable to support shades-of-grey like "X is deprecated" or "Y is experimental".

I think it's worth verbalizing a bit about other ways to organize rules.

Concept: CSS-like Validation DSL

In this pseudocode sketch, the basic concept is to list selectors and apply some policy to the match elements. It resembles CSS. To wit: Given a selector (h1,h2,h3 or a.btn), mark the identified elements as valid/OK. Or, given a selector, mark the matching elements with a warning. Or... call a PHP function to evaluate each matching element.

@define html-content {
  div, span, p, h1, h2, h3, h4, h5, h6, blockquote, pre {ok}
}
@define html-style {
  strong, em, tt, code, del, sub, sup, cite {ok}
  b, i, strike, center {
    /* "ok" above was a special short-hand for "$this->ok();", but generally... these are PHP blocks */ 
    $this->warn('The old school layout tags are deprecated. Use a semantic tag like <strong> or <em>.');
  }
}
@define afform-data {
  af-model-list {ok}
  af-model-prop, af-model-prop[name], af-model-prop[type] {ok}
  af-model, af-model[name] {ok}
  af-field, af-field[field-name] {ok}
  af-model-prop { 
    /* Run the PHP code on each matching element */
    static $entityTypes = Civi\Api4\Entity::get()->addSelect('name')->execute()->indexBy('name');
    if (!isset($entityTypes[$this['type']])) {
      $this->warn(ts('Unknown entity type!'));
    }
  }
  af-model { checkAfformModelName($this) }
}
@define bootstrap-style {
  .btn {
    if ($this->getName() == 'a' || $this->getName()  == 'button') { $this->ok(); }
    else { $this->warn(ts('Bootstrap buttons must use <A> or <BUTTON>.')); }
  }
  .btn-default, .btn-primary, ... {
    if ($this->hasClass('btn')) { $this->ok(); }
    else { $this->warn(ts('Bootstrap button decorators must be used with "btn".')); }
  }
}

/** A user-editable form may contain a mix of some HTML, some Angular, and some BootstrapCSS */
@define afform-gui-editable {
  @include html-content, html-style, afform-data, bootstrap-style
}

What I like about this: The DSL is concise. If you know CSS and a little PHP, then it should be fairly easy to read. (The Github CSS syntax highlighter even does a nice job despite some non-CSS bits.) CSS provides notations for tag-elements, attributes, and HTML classes.

What I don't like about this: Using a DSL requires more parsing work. If one needs to generate rules programmatically (e.g. via hook), then you have to go understand another notation.

Concept: Array of Selector/Action Rules

In this sketch, we avoid the need to implement a DSL. Just expose an array data-structure.

/**
 * @var array $rulesets
 *  Ex: $rulesets['myrule'][] = ['match' => string $cssSlector, 'call' => mixed $callable];
 */
$rulesets = [
  'html-content' => [
    ['match' => 'div, span, p, h1, h2, h3, h4, h5, h6', 'call' => Auditor::OK],
  ],
  'html-style' => [
    ['match' => 'strong, em, tt, code, del, sub, sup, cite', 'call' => Auditor::OK],
    ['match' => 'b, i, strike, center', 'call' => function($ctx) {
      $ctx->warn(ts('The old school layout tags are deprecated. Use a semantic tag like <strong> or <em>.'));
    }],
  ],
  // et al...
];

What I like: It's amenable to hooking and merging; it can be amenable to serialization (depending on what callback notations are allowed). It's easy to imagine adding more metadata to each rule (like a weight or a symbolic name).

What I don't like: The array-structure gets fairly deep and doesn't document itself.

Concept: Fluent Rule Builder

In this sketch, it uses the same mental model as the other two (match a CSS selector; specify a callback function). However, it uses a fluent OOP style to build the rules. Some of the fluent functions (ok($cssSelctor) or warn($cssSelector, $message)) are short-cuts for registering callback functions.

$rulesets = Civi::service('afform_rule_sets');

$rulesets->define('html-content')
  ->ok('div, span, p, h1, h2, h3, h4, h5, h6');

$rulesets->define('html-style')
  ->ok('strong, em, tt, code, del, sub, sup, cite')
  ->warn('b, i, strike', ts('The old school layout tags are deprecated. Use a semantic tag like <strong> or <em>.'))
  // Or... an equivalent but more general-purpose notation...
  ->call('b, i, strike', function($ctx){
      $ctx->warn(ts('The old school layout tags are deprecated. Use a semantic tag like <strong> or <em>.'));
  })

What I like: You get better IDE support (autocomplete/drilldown).

What I don't like: The canonical form is PHP code that's hard to serialize/transmit. Adding in weights and symbolic-names may not be as pretty.

totten / afform