How to pre-processe the duplicate peptide in datasets?

openvax / mhcflurry

Peptide-MHC I binding affinity prediction

http://openvax.github.io/mhcflurry/

Apache License 2.0

191 stars 57 forks source link

How to pre-processe the duplicate peptide in datasets? #129

Closed 0x1orz closed 5 years ago

0x1orz commented 6 years ago

The curated dataset has many the duplicate , consisting of more than half . the peptides have some difference measurement_values.

timodonnell commented 6 years ago

Yes, the uncertainty in the measurement values can be quite high. In the current version we train directly on all the peptides, duplicates and all. When computing validation accuracy on held out data this requires being careful to remove any peptides in both train and test sets.

In earlier versions we've tried grouping by peptide and taking geometric mean or median, but anecdotally I haven't seen that make a big difference.