jamesaoverton / cmi-pb-terminology

CMI-PB Controlled Terminology
0 stars 0 forks source link

Consider using datatype hierarchy only when specific datatype is invalid #47

Closed jamesaoverton closed 2 years ago

jamesaoverton commented 2 years ago

In VALVE and our current prototype the datatypes form a tree, so a 'CURIE' datatype might have ancestors: 'word', 'nonspace', 'trimmed_line', 'trimmed_text', 'text'. The goals of this design include:

  1. use the hierarchy to make child regular expressions simpler
  2. use the hierarchy to identify the exact problem, e.g. " foo:bar" is not a 'CURIE' because it fails 'trimmed_text', so we can tell the user to remove the leading whitespace
  3. use the hierarchy to identify appropriate suggestions for fixes, e.g. automatically suggest trimming leading whitespace to get "foo:bar"

The current VALVE and prototype implementations check, for each cell, all the ancestor datatypes from most general to most specific, stopping at the first failure.

I still like design goals (2) and (3), but (1) has not been a big problem. With the new match() function I'm finding that we can simply express exactly what should match, without relying on the ancestors. (1) forces us to check ancestor datatypes first, which can be inefficient. A new consideration is that (1) prevents us from translating our datatype checks into other validation systems, such as JSONSchema.

So I've been considering dropping design goal (1), and requiring that each datatype specifies exactly what should match. An example: for 'CURIE' instead of search(/:/) which relies on ancestors such as 'word' and 'trimmed_line', we would specify match(/\w+:\w+/) (or something) which is a complete specification. If this one regex matches, then we're done. If validation fails, then we make use of the hierarchy for useful error messages and suggestions, satisfying design goals (2) and (3). If desired, we can translate match(/\w+:\w+/) into JSONSchema "pattern": "^\w+:\w+$" -- we lose the benefits of the benefits of (2) and (3) but we can actually match the pattern, where search(/:/) did not carry enough information by itself.

Thoughts?