openvax / mhcflurry

Peptide-MHC I binding affinity prediction
http://openvax.github.io/mhcflurry/
Apache License 2.0
193 stars 58 forks source link

IC50 vs percentile rank #132

Closed saskra closed 5 years ago

saskra commented 6 years ago

If I understand correctly, you are using IC50 values. I would rather like to train and test on percentile ranks. Is this possible somehow? It seems that simply training on them still leads to IC50 like values in the evaluation which lowers the correlation and other quality measures a lot.

timodonnell commented 6 years ago

Yes, we're using IC50 values for training and prediction. The percentile ranks are calibrated separately for each allele after training by generating a histogram of IC50 values across a large number of random peptides.

If you want to train and test on values with a different scale, it should be an easy change though - I'd suggest just editing https://github.com/openvax/mhcflurry/blob/master/mhcflurry/regression_target.py to define the to_ic50 and from_ic50 functions as needed to transform your raw values into numbers between 0.0-1.0 (for from_ic50) and the inverse (to_ic50). If you are working with raw percentile ranks already in the range 0.0-1.0 these could just be the identity function. Of course however these won't necessarily be calibrated (i.e. no guarantee that 20% of random peptides will have predictions < 0.2).

Curious how this goes for you, let me know if you run into trouble.

On Tue, Oct 30, 2018 at 10:10 AM saskra notifications@github.com wrote:

If I understand correctly, you are using IC50 values. I would rather like to train and test on percentile ranks. Is this possible somehow? It seems that simply training on them still leads to IC50 like values in the evaluation which lowers the correlation and other quality measures a lot.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openvax/mhcflurry/issues/132, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcjuJ8TlAmOjm_l-3LJN_OgADzTatr7ks5uqF3IgaJpZM4YCBkt .

saskra commented 6 years ago

Thanks! My values originally range from 0.1 to 100, so I tried to workaround by multiplying them with 10 (to keep the percentile rank values <1) and I changed the max_ic50=50000.0 to max_ic50=1000.0 - let us see, how this works out.

I have other subsets with other ranges of values, e.g. one where there are only good binders with values <1 - do I have to adapt for that in the source code every time, or is there a possibility to pass the limits as parameters somewhere?

timodonnell commented 5 years ago

We're focused on training on publicly available datasets for now but curious to hear how this went for you. Closing for now.