Data issues in Data_S3.csv (BA predictor training data)

openvax / mhcflurry

Peptide-MHC I binding affinity prediction

http://openvax.github.io/mhcflurry/

Apache License 2.0

183 stars 56 forks source link

Data issues in Data_S3.csv (BA predictor training data) #238

Open elonp opened 1 week ago

elonp commented 1 week ago

Hello!

I analysed Data_S3.csv, described in your publication as the training data for you BA predictor, and found some data errors. Here's my notebook: https://gist.github.com/elonp/bcfbc4b417552d01b2b3d11896a19129 I also attach it as PDF. inspect-mhcflurry-ba-training-data.pdf

Hopefully the analysis is useful for others. It would be wonderful if you could corroborate my analysis, guesswork and conclusions!

timodonnell commented 1 week ago

Thanks for reporting this!

Haven't gone through it in detail but one thought - the "measurement_source" column is not really indicating an exact study or sample ID unfortunately. It's something I did quickly to track where measurements are coming from but it's pretty inexact - I believe it's just last author plus the measurement type (what IEDB calls "Method"). So one possible contributor here is that we may have separate measurements of the same peptide/HLA from different studies from same last author with identical measurement_source. In general my approach has been to show the predictor conflicting values for the same peptide / HLA whenever those occur in the training data, i.e. I have not tried to collapse duplicates or find consensus values.

Curious if this makes sense to you and if so how much your observations this might explain.

timodonnell commented 1 week ago

Also just to note, all of the curation code is on github (see the dirs staring with data_ in downloads-generation) and the bulk of it I believe is in this file:

https://github.com/openvax/mhcflurry/blob/master/downloads-generation/data_curated/curate.py

If you are able to identify places where we are introducing any of the issues you are seeing please let me know 🙏

The MS measurements with wrong inequality seems pretty important to fix if it is common.

elonp commented 1 week ago

I think your approach of allowing conflicting measurements in the training data makes sense.

Thanks for the pointer to the curation code. I will try to review it to see if it explains some of my findings.

I wonder if some of the issues are actually upstream to your curation - I find the definitions of measurement_inequality a bit confusing (does < mean measurement_value is an upper bound, i.e. real_value<measurement_value or a lower bound, i.e. measurement_value<real_value? I had to go over your loss code to verify it's the former, and I'm still not sure), so I wonder if it is possible that some of the upstead data authors interpreted it wrongly. There is evidence for this in duplicate sets of affinity measurements reported with both < and >, and my concern is that it is possible that this also happened in non-duplicate entries and we cannot detect it.

elonp commented 5 days ago

I updated the notebook with: a. number of records for each of the data issues reported in the summary. b. further analysis leading to a retraction of one of the minor issues.