Possible inconsistencies in the dataset.

grayfall commented 5 years ago

Dear MHCflurry team,

We are currently reviewing several open-source MHC-peptide-binding affinity predictors. Our routine includes retraining, independent benchmarking and an in-depth analysis of the training and testing datasets provided by the authors. While working with MHCflurry, we've found two potential issues in the 'curated_training_data.with_mass_spec.csv.bz2' dataset you've used to train MHCflurry:

An unreasonably large fraction of exact IC50 measurements (i.e. '='-labeled entries) are equal to 20000nM. As far as we know, most of these data come from Buus lab assays with affinity values capped at 20000nM (see, for example, this paper by Peters et al. https://www.ncbi.nlm.nih.gov/pubmed/16789818). Is it possible that you have accidentally mislabeled these inexact measurements, i.e. assigned '=' instead of '>'?
We've noticed that quite a few quantitative entries present in both 'curated_training_data.with_mass_spec.csv.bz2' and IEDB lack any inequality labels in the latter. Have you used a different source of information to infer inequalities for such entries or do you consider all unlabelled quantitative entries in IEDB exact (i.e. '=').

We are looking forward your response.

Best Regards, Ilia Korvigo.

timodonnell commented 5 years ago

Hi Ilia,

Thanks for taking a look at this. It would be great to find ways to improve the training data curation.

There are three ways that a measurement can have an inequality other than (=) in our training data:

It was derived from the Kim 2014 benchmark, in which case we use the inequality given in that dataset
It's in IEDB as a 'qualitative' measurement, which we map to quantitative + inequality using the dict here.
It's a mass spec hit (only applicable to the curated_training_data.with_mass_spec.csv.bz2 training set), in which case we set its value to be < 500nM.

For quantitative measurements in IEDB (your question 2), we do set an (=) inequality, see: code here

I think this answers both your questions, since the = 20000 you referred to in question 1 are coming from IEDB (spot checked a few but let me know if you disagree).

Does IEDB expose inequalities for e.g. these Buus lab 20,000 nM measurements? If there's a way to incorporate inequalities for these measurements, could be worth a try to see if it improves the models. However, in practice I don't think a measurement of = 20000 nM vs > 20000nM would likely affect the predictors significantly.

Hope this helps, let me know if anything still unclear.

Tim

grayfall commented 5 years ago

Thank you very much for the quick reply.

84% of Buus records from your dataset are present in IEDB, though inequality labels do not match in most cases. You consider them exact (i.e. '=') across the board, while most of them are inexact in IEDB. Here is the inequality label distribution for these records as per IEDB:

timodonnell commented 5 years ago

I think I know what happened. The IEDB database export we used for the most recent training run (available here, which we downloaded from IEDB in Jan 2018) did not include measurement inequalities. We therefore used (=) for all quantitative measurements. But the IEDB database export feature ('mhc_ligand_full.zip' from here) seems to now have been updated to include measurement inequalities.

I agree using these inequalities next time we train mhcflurry predictors would make sense. Thanks for pointing this out!

grayfall commented 5 years ago

@timodonnell I'm glad this might help you improve MHCflurry. Thank you for taking the time to consider this issue.

openvax / mhcflurry

Possible inconsistencies in the dataset. #138