Closed matsen closed 3 years ago
Note that these aren't really sequential. Step 2 could be done first, as we could re-implement what we already have in a more general framework.
In other news, we need to include the HOC variance as a component of the model:
Vampire model config file: https://github.com/matsengrp/vampire/blob/master/vampire/demo/model_params.json
Alright, I'll explain more in depth tomorrow, but I think I've got #1 and #2 finished up in https://github.com/matsengrp/torchdms/tree/14-gaussian-sketch. I've generalized everything and added a model to predict two targets. Take a look at the notebooks/demo.ipynb to have a look.
Note: The CLI has NOT been finished yet, I'll do that tomorrow.
So I've been thinking about an easy way to implement the gaussian likelihood loss -- and probably the best thing I can think of (in terms of not ripping up things like analysis.py
to be compatible with variance measurements) is to create a special case of VanillaGGE
such that when Gaussian or Cauchy loss are used, an extra input node is added for the independent measurement error of each variant (this would be sigma_yv).
If there are no error estimates for the variants in the dataset, we simply pass a zero into this node for the variant. This node would either a.)run to an intermediate node where we sort of "fit the bias" (theoretically the house-of-cards epistasis variance) during training, and then passing this through to the output node -- weights would have to be frozen at 1 if we wanted to keep the same formulation for total variance as the Otwin models) or b.) run the edge from the input node straight to the output node and add a bias to the output node to represent the house-of-cards-epistasis.
I think this would make coding this up easier at least (just a few changes to model.py
to incorporate this architecture and adding an extra parameter to loss_of_targets_and_prediction
and complete_loss
in analysis.py
).
We would also get to avoid fitting a least squares additive model and the following residual analysis to fit HOC epistasis.
Thoughts?
This has several components.
1. Generalize data loading and processing
Measurements sometimes come with an estimate of their variance. For example if we are interested in a functional score, we would have an equal-sized vector describing the variance of each functional score measurement.
I should also mention that we are going to have stability data soon as well, along with its measurements of variance.
How shall we store this sometimes-present information? We are already abusing BinaryMap by stuffing extra things in it. We could either add extra columns to BinaryMap, or we could come up with our own data structure.
@jgallowa07 , if we were going to use an xarray Dataset, would we come up with an object or just use a Dataset on its own? I assume we could store the one-hot matrix, various floating point data, and a string vector (the mutations themselves)?
2. Generalize training setup
Right now, the data expected for the training algorithm is hard-coded in to the
train
method. I think that we should be able to replace thecriterion
argument with a function that takes abatch
and adevice
and produces a loss. Then we can calltrain
with any objective on whatever data is supplied in the batch.We also need to generalize the current
BinarymapDataset
to several versions that supply different versions of the data as needed. I guess we can have an ABC and then a bunch of subclasses defining different__getitem__
methods?3. Actually implement a Gaussian likelihood
@ZorianThornton5 implemented this like so:
Note that in our case,
mu
is the observation we want to fit, andvar
is the variance (also supplied by the data).This may be the way to go, though if it works I would prefer to use PyTorch's Normal distribution that is equipped with a log_prob method that I can only imagine is back-proppable and vectorized.
4. Work on language
We are going to have "variants" and "variance" 😱 . What to do?????????????????????