girder / viime

https://viime.org
Apache License 2.0
8 stars 3 forks source link

Add mutex categories to transformation model #46

Closed subdavis closed 5 years ago

subdavis commented 5 years ago

In order to be able to reload page state, there must be some way to distinguish what transformations are of what types.

For example, there's no way to know that transformation saved on the server of type log_2 is a transformation type transformation (we also need distinct names for the generic and specific).

We can code all categories into the client (and I'm doing this now), but for each transformation, I'd have to search all the lists to say "is this type in this category's list?".

How do we signify that vast_scaling is a type of scaling so that the vast radiobox can be selected in scaling group?

In a similar vein, should the server enforce mutual exclusion within these special categories? If I try to add vast and pareto scaling, should I (the web client) have to go to the trouble of deleting one and adding the other, or should the "radio button" mutex behavior happen on the server? Right now, I'm doing the client-side logic, but it isn't atomic so it would be easy to get into unsafe states.

Are mutually exclusive transactions the rule or the exception? (they seem like the rule so far)

proposal

I propose adding a new nullable field category to the transaction model and enforcing mutual exclusion for transformations where category is defined using using postgres transactions.

jbeezley commented 5 years ago

I think now that we understand better what the UI is going to look like, it would be best to pull back on the idea of an abstract "transformation" model. I'm considering putting the information directly into the csv_file model with new nullable columns:

I can't think of any reason why you would need more than one of each of these categories. I don't think there is a particular need for user customization of the order in which they are applied either. If there are more categories of operations, then we can add more columns.

For the moment, I think I'll just make the columns strings denoting the type of operation. At some point, we will need to store arguments, but I'll put off deciding how to store that until we need it.

As for a public API, what do you think of something like the following? (Not sure if it should be PUT or POST, :thinking:)

PUT /csv/<id>/<normalize|transform|scale> {"type": <type>, "args": {<args>}}

Removing the operation could be either

DELETE /csv/<id>/<normalize|transform|scale>

or

PUT /csv/<id>/<normalize|transform|scale> {"type": null}
subdavis commented 5 years ago

I can't think of any reason why you would need more than one of each of these categories.

There are other types of transformations, like filters, infill, etc that you could potentially stack.

The main reason I like the transformation model it it's well defined and separated. If some developer wants to add new transformations later, they only need to implement a function that takes and returns a data frame. You can add entire categories of transformations without having to know or care what's in models.py or perform any database migrations.

Is there something problematic about the current API, or are you trying to reduce code? The vuex store is pretty tied to the current implementation, and I (personally) don't understand enough about the client past the cleanup table to conclude that either way is better.

jeffbaumes commented 5 years ago

I like the reasoning @jbeezley, and I'd think of them as properties of the csv that start as null that you are modifying, so PUT makes sense to me, as well as not using DELETE.

If we need tons more of these, or need it to be extensible, we could organize the code such that you register a name and associated processing function with the csv model (keeping the general REST API structure as @jbeezley proposes). But it seems architecturally heavy to do that now. There will indeed be a fixed set of transformations done in a fixed order for the foreseeable future.

jeffbaumes commented 5 years ago

FWIW this is how I see the processing steps right now. Everything is fixed and linear up to normalized data, then things get a bit more complex. There is fan-in for integration and then fan-out for various chosen analyses.

  raw data
     |
     V
classify rows/columns
     |
     V
imputation/filtering
     |
     V
normalize
     |
     V
transform
     |
     V
   scale
     |
     V
normalized data -----> perform live PCA and/or other plots to vet normalization
     |
     +<--------------- (optional) other normalized data for integration
     |
     V
integrated data
     |
 +---+---+
 |   |   |
 V   V   V           
analyses (PCA, fold change, volcano, t-test, etc.)
subdavis commented 5 years ago

this is how I see the processing steps right now

That looks like what we've talked about so far, but I'm curious about the data fusion step after normalization.

Most normalization and scaling operations rely on some combination of σ, μ, min, and max. Is it statistically valid to merge data that have been independently normalized or scaled? Wouldn't the raw data need to be merged first before applying such transformations?

My recollection of stats is fuzzy at best, so if this isn't something I need to worry about, that's fine with me. This is more me thinking aloud than anything.

jbeezley commented 5 years ago

Is there something problematic about the current API, or are you trying to reduce code? The vuex store is pretty tied to the current implementation, and I (personally) don't understand enough about the client past the cleanup table to conclude that either way is better.

I would characterize the current API as overly generic. I made it that way because I didn't yet know the details of what we needed. There isn't currently anything wrong with it--I just think it is starting to smell like Stumpf "facts", which were created for similar reasons. I don't think adding new categories of transformations will occur often enough (or ever) to justify the additional burden it adds in generality.

Most normalization and scaling operations rely on some combination of σ, μ, min, and max. Is it statistically valid to merge data that have been independently normalized or scaled? Wouldn't the raw data need to be merged first before applying such transformations?

If I understand correctly, the normalization is done because the datasets are collected with different methods, possibly different units, etc. Joining them together without normalization wouldn't make any sense. The normalization should (in theory) transform them into identically distributed probability spaces where they can be directly compared.