overture-stack / lectern

Data Schema / Dictionary management system
GNU Affero General Public License v3.0
0 stars 1 forks source link

Feature Request: Use of references inside codeList #173

Closed Mandloi2309 closed 1 year ago

Mandloi2309 commented 1 year ago

Detailed Description

In the current implementation, the references to the regex and codelist can be made within schemas. However, references to a variable/element can not be made inside a codelist. For example: If we have two codelist with values:

codelist_1: ['A', 'B', 'C', 'D'] and codelist_2: ['E', 'F', 'C', 'D'].

Here a reference variable for the values C and D can be used for maintaining the control vocabulary in an efficient manner.

This feature provides reusability and helps in maintaining a control vocabulary .

Possible Implementation

Looking for the special character # inside the codelist elements. This might lead to recursive reference.

Suggestion: Including regex inside the codelist, allowing users to validate data based on multiple regular expressions.

joneubank commented 1 year ago

As a proposed solution, we can support references being nested in other references (code lists, functions, regex etc.) by doing the reference replacement process iteratively. After replacing all references in a dictionary, we can then repeat the process of replacing references again so that any nested references are replaced, continuing to repeat this replacement process until there are no more replacements to be made.

Happy to accept other solutions, just a proposal.

mauroz77 commented 1 year ago

I have a similar but slightly different proposal.

The current logic evaluates if the values of properties match the regexp that identifies references, basically targeting only properties whose value is a string (and restricting the '#' symbol to appear at the beginning of the string). I propose to instead assume that the reference (or references) can appear anywhere in the value (converted to string), and probably in multiple places, so the process should identify all the occurrences of the regexp in the text and replace them with the corresponding literal value from the references section.

This would allow us to keep identifying more simple cases like "regex": "#/charset/regex/FREE_TEXT" but also embedded cases like the following, all in one go:

"restrictions": {
                    "required": true,
                    "codeList": [
                        "female",
                        "male",
                        "other",
                        "#/charset/restrictions/NOT_PROVIDED",
                        "#/charset/restrictions/NOT_COLLECTED"
                    ]
                }

Now, if we want to have references inside the references section, we could use a similar approach, just having in mind that the references in the references section should be replaced before the references in the schemas. In this case, the process will involve two steps then:

  1. Replace references inside the references section.
  2. Replace references in all meta and restriction sections on the schemas' fields.

What do you think?

joneubank commented 1 year ago

Thanks for the suggestion. I like the example you have given as it illustrates a weakness to my original proposal, and I want to spell it out clearly here so that we implement it correctly:

When the reference replacement is done inside an array, within a code list, we cannot simply perform string substitution in the schema. Instead, we need to concatenate the array in the code_list with all values in the reference if it was an array.

Demonstrating with two example reference values:

{
  "references": {
    "enums": {
      "yesOrNo": ["Yes", "No"],
      "commonResponses": ["#/enums/yesOrNo", "Maybe"],
      "unknownResponse": "I Don't Know"
    }
  }
}

Then the a schema could have the following fields using these references:

{
  "fields": [
    {
      "name": "question_1",
      "valueType": "string",
      "description": "Simple set of responses available",
      "restrictions": {
        "required": true,
        "codeList": "#/enums/yesOrNo"
      }
    },
    {
      "name": "question_2",
      "valueType": "string",
      "description": "More response options available",
      "restrictions": {
        "required": true,
        "codeList": [
          "Unique Response",
          "#/enums/commonResponses",
          "#/enums/unknownResponse"
        ]
      }
    }
  ]
}

Note that in question_1 the entire code list is a reference, while question 2 has references inside the code list. Both versions should work.

Applying your proposed logic, we would first replace references, resolving to:

{
  "references": {
    "enums": {
      "yesOrNo": ["Yes", "No"],
      "commonResponses": ["Yes", "No", "Maybe"],
      "unknownResponse": "I Don't Know"
    }
  }
}

Then these are used in the schema:

{
  "fields": [
    {
      "name": "question_1",
      "valueType": "string",
      "description": "Simple set of responses available",
      "restrictions": {
        "required": true,
        "codeList": ["Yes", "No"]
      }
    },
    {
      "name": "question_2",
      "valueType": "string",
      "description": "More response options available",
      "restrictions": {
        "required": true,
        "codeList": ["Unique Response", "Yes", "No", "Maybe", "I Don't Know"]
      }
    }
  ]
}
joneubank commented 1 year ago

The only open question after this is how to handle nested references within the references.

Here is an example with nested references. This should make it clear that some additional work from what is described above is needed to ensure that all replacements get done.

{
  "references": {
    "enums": {
      "allResponses": ["#/enums/commonResponses", "#/enums/uncommonResponses"],
      "commonResponses": ["#/enums/yesOrNo", "Maybe"],
      "uncommonResponses": ["Sure", "Whatever"],
      "yesOrNo": ["Yes", "No"]
    }
  }
}

To resolve these references, we need to replace yesOrNo within commonResponses before replacing commonResponses reference in allResponses. It should become:

{
  "references": {
    "enums": {
      "allResponses": ["Yes", "No", "Maybe", "Sure", "Whatever"],
      "commonResponses": ["Yes", "No", "Maybe"],
      "uncommonResponses": ["Sure", "Whatever"],
      "yesOrNo": ["Yes", "No"]
    }
  }
}

There is an error state possible here where we get cyclical references. I do not expect to handle cyclical references, and the dictionary parser should respond with an error in this case:

{
  "references": {
    "enums": {
      "listA": ["A", "#/enums/nodeB"],
      "listB": ["B", "#/enums/nodeA"]
    }
  }
}

Proposed solution

I would suggest a depth-first type algorithm to ensure all replacements are performed in logical order:

  1. For each reference, check for embedded references
  2. If there is an embedded reference, start a list of references we are replacing - this is the list of visited nodes that we will use to check for cycles. Add our original reference path as the first node name in the list, then attempt to resolve the embedded reference.
  3. If there are nested references inside this reference value, then repeat this loop, adding the reference to the list of references visited.
  4. For each nested reference, check if it is in our list of references we have previously visited. If so, this is a loop: return an error message indicating this dictionary cannot be parsed.
  5. When we find a reference with no embedded references to replace, then we can return the found value and perform the needed replacement. This is recursive, so continue this process until all nested references are replaced and the original value is parsed completely.
joneubank commented 1 year ago

Just realised that we wrote all this in the lectern client, when these are concerns handled by the server itself. I'm going to look for a way to move the ticket to the other repo.

justincorrigible commented 1 year ago

monorepo all the things!?

joneubank commented 1 year ago

https://github.com/overture-stack/lectern/pull/183