Open graft opened 5 years ago
Let's describe how we might go through one of the changes I would like to make: normalizing the "Flow" record out of the Sample model in the ipi project.
In the previous era we could use a migration and write code to make whatever transformations we want; now we cannot do that, we must proceed through atomic operations.
The basic problem is that there are five fields for each value associated with a stain on the Sample model. E.g., "treg_file", "treg_notes", "treg_flag", etc., for each of five Flow Cytometry stains. This is terrible and should be normalized away. What we would like instead is this:
class Ipi::Sample < Magma::Model
collection :flow
end
class Ipi::Flow < Magma::Model
parent :sample
identifier :stain_name
attribute :notes
attribute :flag
attribute :stain
file :fcs
end
The first thing we need to do is add the new model. In order to do this we have to join it to the graph, and it should have an identifier. So the operation must be:
add_model(new_model, parent_model, link_type)
Then we can fill in the model with attributes:
add_attribute(model, attribute_name, attribute_options)
We can't remove the old attributes yet, as this would destroy the old data, which would be very bad. If the user tried, this method:
remove_attribute(model, attribute_name)
could return a warning that the attribute contains N data items
, and maybe require an approval hash to be returned as confirmation before proceeding.
So far we've only created new data structures; how do we actually fill them with data? Currently the data we want already exists in the data graph, but it needs to be reshaped in order to move into its new position. Our alternatives are:
1) Provide api mechanisms for copying data from one part of the graph to another 2) Leave the problem up to the user, who can use /retrieve, /query, /update and /load to move/copy the data.
The approach in (1) is probably extremely difficult - the operations might be complex (e.g., in the above example, I need to generate new identifiers) and not easy to define via a simple API. In general (2) seems like a better approach. The main problem is that data that is already in the system has to be copied out and back in. For many data items this is probably not too big a deal, as they are just text. For files, which could be gigabytes in size, this is probably less than ideal; a waste of bandwidth and a difficult problem to organize.
Therefore some API (probably /update) should provide facilities for transferring files from one model/attribute to another model/attribute.
The first step to using a migration API is that the models must be mutable. Currently they are instantiated in code (.rb files). Instead, we want them to come from database records.
Let's begin with three basic models:
project: :short_name, has_many :models
model: :model_name, :identifier, :parent, :dictionary, has_many :attributes,
attribute: :attribute_name, :type, :attribute_class, :description, :label, :match, :restricted, :format_hint, :read_only, :hidden
When Magma boots up, it uses the project, model and attribute tables to instantiate Magma::Models for each of the project models, e.g.:
def load_models
Magma::Projects.all.each do |project|
project.models.each do |model|
class = Magma::Model.new(model.model_class)
model.attributes.each do |att|
class.attribute(att.att_name, ...)
end
class.identifier(model.identifier)
class.parent(model.parent)
class.dictionary(model.dictionary)
end
end
end
This will load up Ipi::Patient, etc., usable as usual within the console or wherever.
In order to complete this transition we must proceed between the following states:
Models and attributes are defined in Ruby code. Addition of attributes or models is accompanied by hand-written migrations.
Models and attributes are defined in database tables. Addition of attributes or models to this table is accompanied by automatic migrations.
So there are four necessary things that must happen, in some order:
Currently we have accomplished some of (3), without having done (1) at all. It seems clear that fully completing (3) will determine (1), i.e., if we are capable of loading models and attributes from the database we no longer have any use for the ruby code defining the models. Similarly completing (4) will fully determine (2), i.e., if we are using auto-migrations there is no need to use hand-written ones.
It also seems possible that we can accomplish the 1 => 3 transition without having to touch the 2 => 4 transition; since the "plan" command computes migrations from the difference between the model and the database schema, it should not matter whether the model comes from ruby code or a database table. Therefore, we may abandon the use of the project-specific ruby models (e.g. magma-ipi/models/sample.rb), but continue to use the "plan" command to generate "hand-written" migrations as before (only now, following update of the attribute table). It's also probable this isn't even an issue as the single production project we need to maintain (ipi) rarely needs updating.
Therefore we can finish accomplishing (1 => 3) by fully abandoning the ruby classes next, and then work on (2 => 4) by replacing "plan" with automatic migrations.
Currently Magma models are defined in project repos, which also include a set of matching database migrations. This means in order to change a model in production (to add an attribute, say), we must open a pull request, review, merge, stage, and deploy. Needless to say this is a lot of process for what might be a very trivial change; this process will surely not scale well to dozens of projects, each of which might require dozens of changes in order to keep them up-to-date.
Instead we'd like something more instantaneous - to be able to amend models directly in production, without having to edit a bunch of code. It would also be nice to be able to create a project and define a set of models without having to create a new github repo, etc. But if we continue to use a relational database (which it seems clear we should, since our data is highly relational), we're forced to write database migrations, since changes to the models must somehow propagate to the database itself.
Following #99, however, we might shift to a paradigm where models are written as code, but instead are instantiated entirely from a database record - this move isn't too difficult since Magma models contain very little actual code and can be almost completely defined by their template schema. When Magma boots, instead of instantiating the model from a class written in a .rb file it can read entries from the database and instantiate them on the fly. E.g., if there is a 'rna_seq' model for the project 'ipi' defined in the magma_models table, Magma will connect this to the ipi.rna_seqs table at run-time.
Subsequently we can make edits to these models via a "migrations" api, the function of which will be to propagate changes to the model to the corresponding database table for the model. E.g., if I add a new attribute to the 'rna_seq' model for the project 'ipi' via the migrations api, which would simply create a new entry in the magma_attributes table, the API will also run a database migration to alter the ipi.rna_seqs table and add the corresponding column.
At first glance this sounds alarming, and rightly so - migrations have the possibility of destroying large portions of your dataset in one fell swoop, so having them applied at runtime seems very dangerous. To limit the danger we might do this:
1) Define a set of specific, atomic operations that are accessible via the migrations API. Off the top of my head, they are: create a column, remove a column, rename a column, add a new model to the model graph, remove a model from the graph.
2) Calculate the data changes resulting from the impending migration, display them to the user ("This change will remove all data from attribute X in model Y") and get their careful approval ("Please resend your request with the following hash to confirm") before proceeding.
3) Try to make the changes reversible.
Roadmap: Moved to planning