the-human-colossus-foundation / oca-spec

Overlay Capture Architecture Specification
European Union Public License 1.2
8 stars 7 forks source link

Lack of support for ranges #44

Open blelump opened 9 months ago

blelump commented 9 months ago

Problem overview

Range, as opposed to set, which is an exhaustive list of values, usually consists of a start and end but doesn't specify all the possible values in the range. Examples are numbers in a range or dates in a range. A first guess might be to use an entry overlay and mimic a range; however, it becomes inefficient when the list of values is potentially significant. Furthermore, when we speak about a range of dates, it is nearly impossible to create entries for all dates that include time.

OCA currently doesn't address ranges at all. What is the DSWG perspective to approach ranges?

carlyh-micb commented 9 months ago

At ADC we've addressed ranges using format overlays and RegEx. image

blelump commented 9 months ago

While I appreciate this type of creativity, it was outside the author's intent to consider format overlay as a way to solve ranges. We indeed should have included ranges properly in the v1, and now is the time to improve. The origin of the format overlay is in the formatting of the attributes rather than narrowing values. With regexes in format overlay, we're unable to catch the proper context of a value we aim to have, that is, within a given range. A regex may enforce it, but semantically, we want a concept like an entry overlay where we say: this attribute gets value from this range or this set. A value from a range has the same meaning as a value from a set (entry overlay), and both are constrained entries, just represented differently.

pknowl commented 9 months ago

A Range overlay should be a new overlay type. The DSWG will write it up as a new RFC.

carlyh-micb commented 4 months ago

0006 - OCA Range overlay

Summary

An overlay for documenting the expected range of values expected.

Motivation

When collecting data from users, implementing a range overlay ensures that the input falls within a specified boundary or interval. This verification helps users understand the acceptable limits of the data they need to provide, reducing the likelihood of outliers or invalid entries. By clearly defining these ranges, data integrity is maintained, and overall data quality is improved.

Tutorial

Intervals are specified with two endpoints. A closed intervial includes its endpoints and is denoted with square brackets. An open interval has two endpoints which are excluded and is indicated with curved brackets. Open and closed notation can be mixed in a single interval.

[0-1] - an closed interval for numbers between zero and one and where zero and one are included in the interval. [0-100) - a mixed interval for numbers between zero and one hundred where zero is included but one hundred is not. (10-90) - an open interval for numbers between ten and ninety were both values are excluded from the interval.

To include positive and negative infinity in your range you would leave that part of your range unspecified. A value with infinity will only use the curved bracket.

(-9) - a range from negative infinity up to but not including nine. [0-) - a range from zero inclusive to infinity.

Intervals only apply to Numeric and DateTime datatypes.

Schema example

The following code describes an example schema to which the example overlays will reference.

Capture base:

{
  "type": "spec/capture_base/1.0",
  "digest": "Etszl9LgLUjllI950rd2lO6rF5-BP_jGzXGBPkFZCZFA",
  "classification": "RDF106",
  "attributes": {
    "Albumin_concentration": "Numeric",
    "Glucose_concentration": "Numeric",
    "Sample _name": "Text",
    "Sample_type": "Text"
  },
  "flagged_attributes": []
}

Example of a range overlay.

There can only be one range overlay in a schema. Each attribute can have one and only one range specified. Only Numeric, Array(numeric), DateTime, Array(DateTime) datatypes can have ranges.

{
  "capture_base": "Etszl9LgLUjllI950rd2lO6rF5-BP_jGzXGBPkFZCZFA",
  "digest": "XXXX",
  "type": "spec/overlays/range/1.0",
  "range": {
    "Albumin_concentration": (0-)
    "Glucose_concentration": (0-1000)
  }
}

Reference

Bracket notation adapted from Wikipedia: https://en.wikipedia.org/wiki/Interval_(mathematics)#Notations_for_intervals.

Drawbacks

Creators will need to check that their range endpoints would pass data verification according to the schema rules (e.g. if a date is used as a range endpoint such as 2024-04-09 this must pass any data verification rules such as in the format overlay).

Rational and Alternatives

There are currently no way to provide ranges that can be interpreted by a machine. A range could always be described in the information overlay.

Prior Art

Unresolved questions

Implementations

blelump commented 3 months ago

A recent business requirement uncovered absolute and relative ranges:

How does the DSWG perceive this type of range?

carlyh-micb commented 3 months ago

We'll have to discuss. My first impression is that these are two overlays Absolute ranges and relative ranges and you can also have both at the same time. The relative range will require significantly more work.