Enable variant effect prediction for models with multidimensional output

kipoi / kipoi

Kipoi's model zoo API

https://kipoi.org/docs/

MIT License

231 stars 41 forks source link

Enable variant effect prediction for models with multidimensional output #282

Open krrome opened 6 years ago

krrome commented 6 years ago

At the moment the scoring function can be user-defined, but the input to the scoring function has to be 1D already in the current setup. Requiring 1D is only necessary for the VCF file annotation really, so it would probably make sense to drop the 1D requirement for the general case and only require it when a VCF file should be annotated - in that case a nD -> 1D transformation could then either be applied to the model outputs (pre scoring function) or post scoring function... We can define a set of "useful" nD -> 1D output conversions that can be used with these kinds of models.

Avsecz commented 6 years ago

maybe it's useful if the user can define any numpy function: 'sum(x,axis=-1)'. E.g. we say that x is the variable to transform and then use the following to run the function:


def preproc(x, user_fn):
  from numpy import *
  return eval(user_fn)

krrome commented 6 years ago

That would work. The only "problem" I see with that at the moment is that right now I am trying to load "column_labels" for the 1D model output so that the VCF entries can be labelled nicely.

At the moment "column_labels" are not mandatory anyways in model.yaml so I am already generating a numerical index if they are missing. This can would also work for your suggestion - it would just not be possible in that case to ever assign "column_labels" properly, as I can't know how the output of the user defined function relates to the original model output.

jeffmylife commented 6 years ago

If you add a function that converts nD -> 1D (multiple rows to one row), wouldn't you lose all sense of resolution? The input size is 131kb and the output separates this into bins (the rows) each with multiple predicted scores (the columns). If you reduce the rows to one row, you lose too much resolution. If you reduce all the column values (by some aggregation function) to one number per row (bin), there is no way to interpret such a number. So there should probably be a separate way to deal with multidimensionality that doesn't involve reduction in dimension. Could you do a .vcf section for each row of the output? Each section (there would be 960) would correspond to one 128bp region. It would be a huge file, though.

Avsecz commented 5 years ago

@krrome is this still relevant?