Closed grayfall closed 5 years ago
Hi Ilia,
Thanks for taking a look at this. It would be great to find ways to improve the training data curation.
There are three ways that a measurement can have an inequality other than (=) in our training data:
It was derived from the Kim 2014 benchmark, in which case we use the inequality given in that dataset
It's in IEDB as a 'qualitative' measurement, which we map to quantitative + inequality using the dict here.
It's a mass spec hit (only applicable to the curated_training_data.with_mass_spec.csv.bz2
training set), in which case we set its value to be < 500nM
.
For quantitative measurements in IEDB (your question 2), we do set an (=) inequality, see: code here
I think this answers both your questions, since the = 20000
you referred to in question 1 are coming from IEDB (spot checked a few but let me know if you disagree).
Does IEDB expose inequalities for e.g. these Buus lab 20,000 nM measurements? If there's a way to incorporate inequalities for these measurements, could be worth a try to see if it improves the models. However, in practice I don't think a measurement of = 20000 nM
vs > 20000nM
would likely affect the predictors significantly.
Hope this helps, let me know if anything still unclear.
Tim
Thank you very much for the quick reply.
84% of Buus records from your dataset are present in IEDB, though inequality labels do not match in most cases. You consider them exact (i.e. '=') across the board, while most of them are inexact in IEDB. Here is the inequality label distribution for these records as per IEDB:
I think I know what happened. The IEDB database export we used for the most recent training run (available here, which we downloaded from IEDB in Jan 2018) did not include measurement inequalities. We therefore used (=) for all quantitative measurements. But the IEDB database export feature ('mhc_ligand_full.zip' from here) seems to now have been updated to include measurement inequalities.
I agree using these inequalities next time we train mhcflurry predictors would make sense. Thanks for pointing this out!
@timodonnell I'm glad this might help you improve MHCflurry. Thank you for taking the time to consider this issue.
Dear MHCflurry team,
We are currently reviewing several open-source MHC-peptide-binding affinity predictors. Our routine includes retraining, independent benchmarking and an in-depth analysis of the training and testing datasets provided by the authors. While working with MHCflurry, we've found two potential issues in the 'curated_training_data.with_mass_spec.csv.bz2' dataset you've used to train MHCflurry:
We are looking forward your response.
Best Regards, Ilia Korvigo.