mountetna / magma

Data server with friendly data loaders
GNU General Public License v2.0
5 stars 2 forks source link

Dictionary support #91

Closed graft closed 5 years ago

graft commented 5 years ago

This PR (built off #90) adds basic dictionary support to the backend. This is sparsely described in #62, so here is a summary of this addition.

1) The Magma::Validation class (previously Magma::Validator) was refactored to make space for the notion of multiple types of available validation. For example, the already-supported model#attribute validation, where attribute#match specifies how the field should be validated, was separated into a Magma::Validation::Attribute class, verifying the presence of an identifier token became Magma::Validation::Identifier. See specs/validation_spec.rb

2) The Magma::Model was extended to hold a Magma::Dictionary, set using Magma::Model#dictionary(dict_model, attribute_map), which allows Model A to be validated using the data in Model D. See specs/labors/models/aspect.rb

3) A Magma::Validation::Dictionary class was created, which uses the Magma::Dictionary from (2) to validate a document against the data from the dictionary model, i.e., for each attribute in the document there is a single matching entry in the dictionary approving this document. See specs/validation_spec.rb

4) Notably, if the dictionary class contains a 'match' attribute, the dictionary will expect the format { type, value }, where 'type' can be String, Numeric, Array, Range, Regexp, and value is JSON string, number or array (e.g. { 'type': 'Range', 'value': [ 200, 300 ] }). This allows a wide range of validations to be performed using only data. To aid in this type a new 'match' attribute was created. See specs/validation_spec.rb, lib/magma/dictionary.rb, spec/labors/models/codex.rb, lib/magma/attributes/match.rb

With this in place we can convert much of our attribute-level validation to using dictionaries:

class Ipi::Sample < Magma::Model
  dictionary Ipi::TumorType, tumor_type: :tumor_type
end
class Ipi::TumorType < Magma::Model
  attribute :tumor_type, type: :json
end

Then, Ipi::TumorType contains an entry like so: { 'tumor_type': { 'type': 'Array', 'value': [ 'tumor', 'normal', 'metastatic' ] } and a new Sample document can be validated against this Array, rather than against the attribute#match defined on the model. This allows us to change the validation simply by making an update to the dictionary record, rather than having to update the project model/sample.rb, deploying, etc.

Notable outstanding issues not resolved by this pull request.

dabrau commented 5 years ago

Looks good. I think it will work well with range and regex validation. My only concern is using the dictionary to encourage denormalizing the database. If certain attributes belong together and can be enumerated I think sometimes it would make more sense to use a foreign key constraint.

e.g. If the only aspects a monster can have is:

name, monster, source, value hide, lion, bullfinch, fur hide, lion, graves, leather hide, hydra, bullfinch, scales hide, hydra, graves, scales mass_in_stones, lion, bullfinch, 20000 mass_in_stones, lion, bullfinch, 120 mass_in_stones, hydra, bullfinch, 1000 mass_in_stones, hydra, graves, 1000

We can use a join table to reduce the overhead of redundant data with: monster_id, aspect_id

graft commented 5 years ago

I think I agree with your caveat about the dictionary encouraging denormalization - a dictionary implies a model => model relationship, which encourages flat data structures. Mostly this is by design, as I wanted to encourage flattening data validated this way to a few (<10) columns so searching is simpler - no one likes searching across joins.

However, not everything should be flattened. But trying to validate across a parent -> child relationship (or parent -> child -> child) adds quite a bit of complexity to the dictionary, I think.

In the specs example, Labors::Aspect has three attributes, :name, :source, :value, and parent :monster. The dictionary, however, validates an input record, where { monster: some-monster-identifier }. This is the extent to which the dictionary can reach across associations, but you could write a dictionary matching monster to /(lion|hydra)/ etc.

Thinking of a more complex example, we might want to have a dictionary that validates a patient's assay record based on the patient's cohort (i.e., cohort -> patient -> assay). The dictionary is defined on Assay but could reference data from it's parent's parent, which would not be present in the validated record. This seems hard to design and execute. Still, something to think about for the future.