Referencing and inheritance

jeroen commented 11 years ago

Does Concordia support labeling type definitions, and referencing them throughout the schema? For example, if on many places in my data there is an element of type 'prompt_response', do I need to redefine the schema in all those occurrences in the schema? Or is there a way to have a global definition of 'prompt_response', and reference that throughout the schema?

If we do support this, we should probably also consider defining some simple inheritance. E.g. with an 'extends' property. So that you can easily define e.g.prompt_singlechoice extends ohmage_prompt and then in the schema you can refer to ohmage_prompt so that field can either contain prompt_singlechoice, prompt_multichoice, prompt_numeric etc.

jojenki commented 11 years ago

For now, I am going to punt on this, and assign it to a future release. If it is decided that this is a worthwhile feature, it can be added without impacting existing definitions.

Pros:

Allows for complex schemas to be subdivided and worked on independently; however, this could also be achieved by an external tool.
If a code-generation tool is ever designed which is part of the future plan, labeling portions of the schema would allow for a more robust, object-oriented design of the auto-generated classes.

Cons:

Makes it more difficult to have self-contained schemas.
Gives a bad code smell. Data may be repeated, but its format should not, except in arrays where the format is still only defined once. Why would the format of the data be exactly the same in two places? This could lead to abusing one schema because someone didn't want to create their own even though they represent different data, just in the same way.

jeroen commented 11 years ago

Could you explain how to do define a schema that does something like "an array (of arbitrary length) of prompt-types" if you don't cannot reference the schema for prompt-type? E.g when I use survey_response.read with column_based output and prompt_id_list = "urn:ohmage:special:all"? Or similarly, for the json-rows output, how is the schema going to describe 'n records (rows)' when there is no reference to a schema for 'row'?

Does the schema need to re-define the concept of a 'row' for every row in the data? And are we going to need a different schema if the output contains 24 rows than when it contains 25 rows? Do you see the schema as a general description of the API I/O, or will the schema be re-generated for every output, describing the "current" data?

jojenki commented 11 years ago

First, the second part. There are two use-cases. The first is when you already know how data will be formatted and want to publish that definition. In this case, the definition will be static for each version, at least; this is generally the case when data is being read from an API, for example. The second is when the data is arbitrarily formatted, but you want to be able to define the data as-is for any consumer. In this case, the definition will be dynamic for each piece of data; this may be the case when data is being uploaded to an API, for example. So, for things like read APIs, the definitions would be static.

Now, the first part. Because Concordia doesn't allow multiple types for a single key, you would need to define unique keys for each type of data. This avoids the "problem" where the value of one key defines the type of another. So, if you have a prompt type that would return a number value, then a key, "numeric_value", could be added to the list of keys to handle that value. Likewise, if another prompt would return a string value, then a key, "string_value", could be added to the list of keys to handle that value. However, it may, or may not, be the case that both wouldn't be returned. By making both "optional", the appropriate key/value pair could be supplied and the other omitted. For example:

{
    "doc":"The root of this data set, which may belong to an encompassing definition, will be an array of prompt responses.",
    "type":"array",
    "schema":{
        "doc":"Each prompt response is an object with the following key/value pairs.",
        "type":"object",
        "schema":[
            {
                "doc":"The first key could be, for example, the prompt that it belongs to, which is a string.",
                "name":"prompt_id",
                "type":"string"
            },
            {
                "doc":"The second key could be, for example, the prompt type, which doesn't necessarily have any bearing on the data type.",
                "name":"prompt_type",
                "type":"string"
            },
            {
                "doc":"Now, add the number-valued key.",
                "name":"numeric_value",
                "type":"number",
                "optional":true
            },
            {
                "doc":"Now, add the string-valued key.",
                "name":"string_value",
                "type":"string",
                "optional":true
            }
        ]
    }
]

Now, all of the following data points would be valid for this schema:

[
    {
        "prompt_id":"age",
        "prompt_type":"number",
        "numeric_value":26
    },
    {
        "prompt_id":"first_name",
        "prompt_type":"text",
        "string_value":"John"
    },
    {
        "prompt_id":"favorite_color",
        "prompt_type":"single_choice",
        "numeric_value":3,
        "string_value":"Green"
    }
]

This is an example that was developed on the fly, but it could be far more involved and complex. The definition may be large and include many possible combinations, but it leads to a design that will always be very well defined and won't lead to ambiguity and unexpected changes when reading the data.

Thoughts?

jshslsky commented 11 years ago

For Open mHealth we have some fairly well-defined use cases for payload_id references.

E.g., omh:prescription could contain a reference to omh:drug

We can look at adopting or refining the JSON Schema approach.

jojenki / Concordia

Referencing and inheritance #2