Closed hammer closed 7 years ago
As I understood it the idea was to train a net on IEDB, use that net to hallucinate examples for alleles that lack training data, then continue to tweak the net until it can hallucinate examples that can't be discriminated from real examples.
Putting a few links up here:
Another link in that space: Adverserial Autoencoders and a blogpost by Dustin Tran on it: http://dustintran.com/blog/adversarial-autoencoders/
Continuing the discussion, here is what I was orginally proposing to try out. I don't have a strong sense to believe this is better, but it would be interesting to try out.
So part 1) my original question was there seems to be a two methods so far:
or
I would think, that a network that takes in two inputs, the peptide and a 1-hot encoding of the alleles could perform decently and share data across peptides. It would two nets that each takes one of those inputs, learn some embedding and then combines those embeddings to make a prediction.
I would think that in doing so, the final classification would be on some latent representation of the peptides and a latent representation of alelles, which would hopefully, in the allele representation, incorporate the alleles similarity to other alleles and in the allele representation, incorporate the similarity of the peptides.
Getting the network to learn this may be difficult and combining the two input types might take something smarter than I am thinking, but I think could make a decent net that allows alleles to 'borrow' data on peptides that we don't have as input.
I've prototyped a network like this here
Not sure how to benchmark it, @iskandr is there a common approach to have to benchmarking that I could test out. I've used the validation scores as a marker, but not sure what range is expected.
The network coded there is:
Part 2) If the above net has reasonable performance, we could then use one of the semi-supervised net techniques to improve it. Essentially feeding in additional peptide/alleles pairs where we don't know the binding and refine the latent representations.
@arahuja This looks like a great idea. One possible extension is to replace the one-hot encoding with a dense vector encoding of the MHC sequence (so that it could generalize to out of sample alleles).
All the evaluation metrics I've been tracking are averaged across alleles -- in the case where you're training a multi-allele model and holding out some small portion of a single allele's data the evaluation time gets really long. Example: https://github.com/hammerlab/mhcflurry/blob/soft-pretraining/experiments/best-synthetic-data-hyperparams.py#L144
@iskandr thinking of doing a (literal) last-minute attempt at this for a ICLR workshop paper: http://beta.openreview.net/group?id=ICLR.cc/2016/workshop Would be nice to summarize what I'm thinking here with a few experiments - what do you think?
Last minute indeed :-)
We could benchmark using the Kim2014 BD2009 as training and the BLIND dataset as test and compare matrix-completion = {none, KNN, softImpute, svdImpute, MICE} vs. your predictor vs NetMHC 3.x
@iskandr Could you point me towards these datasets?
Here's a CSV of the BD2009 training set. Each row is a distinct peptide (~19k), each column is an allele (~100). The entries are max(0, 1 - log_50000(ic50)) of each pMHC IC50 affinity value.
No longer doing imputation so closing for now
@arahuja has some ideas on this one.