(Re-)implement Gaussian likelihood

matsen commented 4 years ago

This has several components.

1. Generalize data loading and processing

Measurements sometimes come with an estimate of their variance. For example if we are interested in a functional score, we would have an equal-sized vector describing the variance of each functional score measurement.

I should also mention that we are going to have stability data soon as well, along with its measurements of variance.

How shall we store this sometimes-present information? We are already abusing BinaryMap by stuffing extra things in it. We could either add extra columns to BinaryMap, or we could come up with our own data structure.

@jgallowa07 , if we were going to use an xarray Dataset, would we come up with an object or just use a Dataset on its own? I assume we could store the one-hot matrix, various floating point data, and a string vector (the mutations themselves)?

2. Generalize training setup

Right now, the data expected for the training algorithm is hard-coded in to the train method. I think that we should be able to replace the criterion argument with a function that takes a batch and a device and produces a loss. Then we can call train with any objective on whatever data is supplied in the batch.

We also need to generalize the current BinarymapDataset to several versions that supply different versions of the data as needed. I guess we can have an ABC and then a bunch of subclasses defining different __getitem__ methods?

3. Actually implement a Gaussian likelihood

@ZorianThornton5 implemented this like so:


def gaus_loglik(inputs, mu, var):
    """ Find the current log-likelihood of Gaussian distribution

    Args:
        - inputs (tensor): Input labels (nx1 tensor)
        - mu (tensor): Target to estimate (px1 tensor)
        - var (tensor): Variance(s) of distribution/variants (1x1 tensor)

    Returns:
        - loglik (tensor): Float value of current log likelihood
    """
    pi = torch.Tensor([math.pi]) # Define pi
    loglik = torch.sum( (-torch.log(2*pi)/2) - torch.log(torch.sqrt(var)) - (inputs - mu)**2/2*var)
    return -loglik

Note that in our case, mu is the observation we want to fit, and var is the variance (also supplied by the data).

This may be the way to go, though if it works I would prefer to use PyTorch's Normal distribution that is equipped with a log_prob method that I can only imagine is back-proppable and vectorized.

4. Work on language

We are going to have "variants" and "variance" 😱 . What to do?????????????????????

matsen commented 4 years ago

Note that these aren't really sequential. Step 2 could be done first, as we could re-implement what we already have in a more general framework.

matsen commented 4 years ago

In other news, we need to include the HOC variance as a component of the model:

matsen commented 4 years ago

Vampire model config file: https://github.com/matsengrp/vampire/blob/master/vampire/demo/model_params.json

jgallowa07 commented 4 years ago

Alright, I'll explain more in depth tomorrow, but I think I've got #1 and #2 finished up in https://github.com/matsengrp/torchdms/tree/14-gaussian-sketch. I've generalized everything and added a model to predict two targets. Take a look at the notebooks/demo.ipynb to have a look.

Note: The CLI has NOT been finished yet, I'll do that tomorrow.

zorian15 commented 4 years ago

So I've been thinking about an easy way to implement the gaussian likelihood loss -- and probably the best thing I can think of (in terms of not ripping up things like analysis.py to be compatible with variance measurements) is to create a special case of VanillaGGE such that when Gaussian or Cauchy loss are used, an extra input node is added for the independent measurement error of each variant (this would be sigma_yv). If there are no error estimates for the variants in the dataset, we simply pass a zero into this node for the variant. This node would either a.)run to an intermediate node where we sort of "fit the bias" (theoretically the house-of-cards epistasis variance) during training, and then passing this through to the output node -- weights would have to be frozen at 1 if we wanted to keep the same formulation for total variance as the Otwin models) or b.) run the edge from the input node straight to the output node and add a bias to the output node to represent the house-of-cards-epistasis. I think this would make coding this up easier at least (just a few changes to model.py to incorporate this architecture and adding an extra parameter to loss_of_targets_and_prediction and complete_loss in analysis.py). We would also get to avoid fitting a least squares additive model and the following residual analysis to fit HOC epistasis. Thoughts?

matsengrp / torchdms