rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data
https://RforMassSpectrometry.github.io/QFeatures/
25 stars 7 forks source link

Feature: automatic identification/labelling of contaminant peptides #191

Open cvanderaa opened 1 year ago

cvanderaa commented 1 year ago

Data sets may lack information about contaminant peptides when the user did not provide a contaminant database during raw data identification. We could provide functionality to automatically label peptide that map to a contaminant protein.

Contaminant proteins could be retrieved as described here. Once the function gets the list of contaminants, there could be two options (we could implement one of the two or both):

  1. Use the protein ID (Uniprot ID?) to match peptides that are mapped to these proteins. Drawback: Uniprot ids (or I guess any ID/naming system) is subject to change and may compromise the matching between the id in the data set and the id in the contaminant database.
  2. Retrieve the contaminant protein sequences and perform peptide alignment on these sequences. Drawback: it's a bit more complicated to implement. Also, should we consider polymorphisms during alignment?

Once the contaminant peptides are identified, we could add a logical column (eg isContaminant) in the rowData.