overture-stack / lectern

Data Schema / Dictionary management system
GNU Affero General Public License v3.0
0 stars 1 forks source link

Feature Request: Unique restriction on fields within schema #175

Closed joneubank closed 1 year ago

joneubank commented 1 year ago

Add a field restriction that will declare each value of the field must be unique.

Detailed Description

Add a field level restriction that specifies all values of this field must be unique. This specifies that when a data-set is validated against this schema that each record will be checked to ensure the same value does not occur for this field more than once within the data set.

Potential Risks

Lectern client implementations will need to be provided the entire data-set to enforce uniqueness. In applications where data is validated in chunks and added to a larger datbase, it might require custom implementations to read the entire data-set from the database to ensure the restriction is met. Still, marking this requirement on the schema so that clients can enforce this rule is a useful.

Possible Implementation

Add to the meta schema a new restriction on a schema field called unique that will have a boolean value true or false, similar to required.

A valid schema matching this meta-schema would look like:

{
  "name": "schema_name",
  "description": "Example Schema with a unique ID field",
  "fields": [
    {
      "name": "example_id",
      "valueType": "string",
      "description": "Unique identifier for schema",
      "restrictions": {
        "required": true,
        "regex": "^[A-Z]{4,10}$",
        "unique": true // <====== new restriction declaring unique values are required
      }
    }
  ]
}

A note on missing values

It is possible for a field with unique restriction to not have the required restriction. If this occurs, then a missing value is treated as one unique value for that field: only one record can have a missing value.

joneubank commented 1 year ago

I am considering changing this requirement to allow different behaviour for missing values. It might be desirable to only enforce uniqueness on records that have a value in that field, allowing any number of missing values.

In this case we should change the restriction from true/false to having modes:

Alternate 1

"unique": "all" - no value more than once, only one missing value allowed "unique": "values" - all provided values must be unique, any number of missing values are allowed if unique restriction is not provided then no uniqueness enforced

Alternate 2

"unique": true - all values unique, only one missing value allowed "uniqueOrMissing": true - all provided values must be unique, any number of missing values are allowed

If both restrictions are set the more strict unique restriction will be enforced.

joneubank commented 1 year ago

Last edit about missing values:

Alternate 3

It seems most reasonable to not enforce uniqueness on missing values. If the field is not required, and a value is not provided, then the uniqueness restriction is not considered. Same way code_list and regex would be handled.

This gives the original proposal (always strict), and three alternatives to consider. Will confirm implementation in a comment after team review.

joneubank commented 1 year ago

After discussion it is agreed that the standard approach to monitoring uniqueness is that only fields with values will be considered when applying the uniqueness restriction. This means that any number of records with undefined values are allowed, but if the field has a value then it must be unique. To ensure every record is unique, the field should also have the required restriction set to true.

This matches Alternative 3.