instedd / cdx

Connected Diagnostics Platform
https://cdx.io
9 stars 7 forks source link

Define global set of fields for CDX instance #159

Closed spalladino closed 8 years ago

spalladino commented 9 years ago

Expand the searchable fields YAML spec to also cover non-indexed fields. All manifests should be validated against this spec, and the fields schema for NNDD (and the default manifest) should be loaded from this file.

spalladino commented 9 years ago

The current searchable fields config has the following attributes:

Searchable Field:
  Name: field name in elasticsearch
  Type: string, integer, float, date, location
  Index: analyzed vs not analyzed
  Filters:
    Names: parameter name in query (two parameters for ranges, plus default if eq is supported)
    Type: match, range, wildcard (wildcard can be a performance issue)
  Grouping:
    Name: parameter name in query
    Type: range, date, flat, kind, nested
  Validations:
    Options

Whereas the manifest has the following ones:

Manifest:
  Field name
  Type: ‘integer’, ‘long’, ‘float’, ‘double’, ‘date’, ‘enum’, ‘location’, ‘boolean’ or ‘string’
  Section: event, sample, patient
  Flags:
    Core
    Indexed
    Pii
  Validations:
    Options
    Range
    Date format
  Source:
    Operations

The fields definition could be enhanced with further info from the manifest, such as validations, flags, and additional types, to be able to:

Field:
  Name: canonical field name
  Store: name in storage engine
  Type: as in manifest + *nested* (array), *object* (single)
  Section: event, sample patient
  Flags: core (all true), indexed, pii + *reserved*, *full-text-searchable*
  Validations: options, range, date format
  Filters: as in searchable fields
  Grouping: as in searchable fields
spalladino commented 9 years ago

@nekron thoughts?

macoca commented 9 years ago

I like it. I'm a bit puzzled by the implications of the different flags, they doesn't sound bad, but their meaning doesn't seem to be that obvious. They shall be supported with a vast documentation ;).

The validations were something that we've already introduced, in some part, into the field definitions. But we removed them (not yet everywhere) because it's impossible to know in advance all the possible values of a given field, e.g. the condition field. It's a tricky one, and I'm still not fully convinced of any of the decisions we made.

Let's discuss the possible scenarios in the blackboard.

macoca commented 9 years ago

But maybe we can know all the possible values of the condition field, I mean, if you want to support a different kind of condition, it's just a matter of adding it to the list.

When we decided to remove the validations, we were thinking that, for a new device to plug into cdx, the only thing that a manufacturer should do is to write a manifest. The platform will then compile the different manifests and build the system specification and schema dynamically only by extracting information from them. So the resulting system will be a sort of pluggable ecosystem depending on the manifests that were included in that instance. Elegant, wonderful and mostly utopic. For this to work, the manufacturer must know a lot of internal details of the cdx implementation, because the manifest specification is tightly coupled to them.

But for a new condition type to be reported in a cdx-p instance, maybe we can give us the luxury of having to write it in the configuration file of that instance. Or maybe we can have some kind of conditions crud, I don't know.

What I know is that no new condition comes without its particular exceptions to the rules. So it's very likely that we will have to modify some part of the code to support that new condition, and having to add it to the list of 'known-to-work' ones, doesn't sound that bad.

spalladino commented 9 years ago

Keeping the set of conditions restricted is a good way of avoiding the situation where a manufacturer reports "MTB" while another reports "Mycobacterium Tuberculosis". Also, the merge of fields is good as long as everyone agrees on their usage, type and restrictions. I think that restricting some flexibility is good for having a more unified data storage.

I think it would be good for @nditada to join the discussion as well.

spalladino commented 9 years ago

A possible design for keeping a unified set of core fields, plus custom fields per manufacturer, could be to keep the following hierarchy:

  1. A core global set of fields, defined by a CDX instance admin, including fields like sample_uid, results, timestamp, etc
  2. An extended core set of fields per assay (or condition), defined by a CDX instance admin, such as HIV for MTB
  3. An event type with custom fields defined implicitly by a manufacturer via its manifests
  4. A manifest per device type, defined explicitly by manufacturers, which defines how the fields in the previous instances are populated, and implicitly defines fields for (3)

The manufacturer event type (item 3) can be defined by specifying an event_type attribute in the manifest. This would impose the restriction that all custom fields shared between manifests with the same event_type must have the same type. This would allow a manufacturer to query on these fields when specifying the event_type as a context.

This separation would allow any client to query on:

  1. All events, using the global set of fields
  2. All events of a condition, using the core set of fields per assay
  3. All events of an event_type defined by a manufacturer, using the global set of fields plus the custom fields defined by that manufacturer
  4. All events of an event_type within a condition, using all the custom fields defined in the event_type, the core fields for the condition, and the global core fields
macoca commented 9 years ago

.5. All events for a given device model, and therefore that uses the same manifest, can be queried by any of the custom indexed fields specified in that manifest.

spalladino commented 9 years ago

Regarding (5), as that manifest defines an event_type, then the queries would be on the event_type itself, as its defined fields are the union of all the custom fields of its manifests.

macoca commented 9 years ago

Ok. For (4) I was assuming intersection, not union. Therefore, (5) would've been for the remaining fields. But union is ok too I think.

spalladino commented 9 years ago

I was thinking about a union, but validating that the intersection has the same configuration (type, event/sample/patient, indexed, etc)