openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
193 stars 58 forks source link

Explore alternatives to matrix completion for lifting performance on alleles with few examples #7

Closed hammer closed 7 years ago

hammer commented 8 years ago

@arahuja has some ideas on this one.

hammer commented 8 years ago

As I understood it the idea was to train a net on IEDB, use that net to hallucinate examples for alleles that lack training data, then continue to tweak the net until it can hallucinate examples that can't be discriminated from real examples.

hammer commented 8 years ago

Putting a few links up here:

arahuja commented 8 years ago

Another link in that space: Adverserial Autoencoders and a blogpost by Dustin Tran on it: http://dustintran.com/blog/adversarial-autoencoders/

arahuja commented 8 years ago

Continuing the discussion, here is what I was orginally proposing to try out. I don't have a strong sense to believe this is better, but it would be interesting to try out.

So part 1) my original question was there seems to be a two methods so far:

  1. Train a network per allelle, using how many ever samples are available

or

  1. Take a matrix of allele/peptide pairs and "fill-in" the binding values for allele/peptides pairs that don't have one (essentially learning the similarity of alleles based on which peptides they bind to, right?)
  2. Use that data to train a net, invidiually for each allele, but augmented with the "filled in" values? Maybe these are sampled in some ways based on similar or different they were to the original alleles or how confident the original values were? @iskandr is this about right? I don't know the exact details, but the process is similar to this?

I would think, that a network that takes in two inputs, the peptide and a 1-hot encoding of the alleles could perform decently and share data across peptides. It would two nets that each takes one of those inputs, learn some embedding and then combines those embeddings to make a prediction.

I would think that in doing so, the final classification would be on some latent representation of the peptides and a latent representation of alelles, which would hopefully, in the allele representation, incorporate the alleles similarity to other alleles and in the allele representation, incorporate the similarity of the peptides.

Getting the network to learn this may be difficult and combining the two input types might take something smarter than I am thinking, but I think could make a decent net that allows alleles to 'borrow' data on peptides that we don't have as input.

I've prototyped a network like this here

Not sure how to benchmark it, @iskandr is there a common approach to have to benchmarking that I could test out. I've used the validation scores as a marker, but not sure what range is expected.

The network coded there is: image

Part 2) If the above net has reasonable performance, we could then use one of the semi-supervised net techniques to improve it. Essentially feeding in additional peptide/alleles pairs where we don't know the binding and refine the latent representations.

iskandr commented 8 years ago

@arahuja This looks like a great idea. One possible extension is to replace the one-hot encoding with a dense vector encoding of the MHC sequence (so that it could generalize to out of sample alleles).

All the evaluation metrics I've been tracking are averaged across alleles -- in the case where you're training a multi-allele model and holding out some small portion of a single allele's data the evaluation time gets really long. Example: https://github.com/hammerlab/mhcflurry/blob/soft-pretraining/experiments/best-synthetic-data-hyperparams.py#L144

arahuja commented 8 years ago

@iskandr thinking of doing a (literal) last-minute attempt at this for a ICLR workshop paper: http://beta.openreview.net/group?id=ICLR.cc/2016/workshop Would be nice to summarize what I'm thinking here with a few experiments - what do you think?

iskandr commented 8 years ago

Last minute indeed :-)

We could benchmark using the Kim2014 BD2009 as training and the BLIND dataset as test and compare matrix-completion = {none, KNN, softImpute, svdImpute, MICE} vs. your predictor vs NetMHC 3.x

arahuja commented 8 years ago

@iskandr Could you point me towards these datasets?

iskandr commented 8 years ago

Here's a CSV of the BD2009 training set. Each row is a distinct peptide (~19k), each column is an allele (~100). The entries are max(0, 1 - log_50000(ic50)) of each pMHC IC50 affinity value.

incomplete-bd2009.csv.gz

timodonnell commented 7 years ago

No longer doing imputation so closing for now