lgatto / Pbase

Manipluating and exploring protein and proteomics data
8 stars 3 forks source link

Calculate heavy labeled peptides #6

Closed sgibb closed 9 years ago

sgibb commented 10 years ago

This PR implements @pavel-shliaha's feature request https://github.com/sgibb/cleaver/issues/5 in Pbase. It is WIP and I want to discuss some points before thinking about merging this feature into the master branch.

The workhorse function .addOverhang works quite well. Based on @pavel-shliaha's testfiles: proteins_of_interest.fasta and result_26_09_13.csv I compared the output of .addOverhang (using test_heavylabels.R; I will write a proper unit test if all questions are answered). I get an overlap of 99%. For only 5 peptides (out of 389) I get different results. But IMHO it is because @pavel-shliaha looks for K or R to cleave and I am looking for K or R not followed by P (I used this simpler rule instead of the default cleaver rule because it seems that the input was cleaved with this rule but the implementation of .addOverhang is totally independent of any cleaving rule). @pavel-shliaha could you please verify that these 5 peptides are correct:

 FVTTDTR NOT equal
Pavel's list:
    Peptide N_overhang C_overhang          spikeTideResult        spikeTide
283 FVTTDTR GPAKIVRPTK       RTTQ both_overhangs_shortened RPTK.FVTTDTR.RTT

Pbase result:
          Peptide N_overhang C_overhang      spikeTideResult         spikeTide
AT5G60920 FVTTDTR       RPTK       RTTQ fully_representative RPTK.FVTTDTR.RTTQ

###############################################################################

 IDLVIAK NOT equal
Pavel's list:
   Peptide N_overhang C_overhang      spikeTideResult          spikeTide
57 IDLVIAK     IPGKPR        LPC fully_representative IPGKPR.IDLVIAK.LPC

Pbase result:
          Peptide N_overhang C_overhang      spikeTideResult        spikeTide
AT1G77130 IDLVIAK       GKPR        LPC fully_representative GKPR.IDLVIAK.LPC

###############################################################################

 VLTAYLK NOT equal
Pavel's list:
   Peptide N_overhang C_overhang      spikeTideResult          spikeTide
37 VLTAYLK       EAGK      GRPAF fully_representative EAGK.VLTAYLK.GRPAF

Pbase result:
          Peptide N_overhang C_overhang      spikeTideResult        spikeTide
AT1G74380 VLTAYLK       EAGK        GRP fully_representative EAGK.VLTAYLK.GRP

###############################################################################

 YFNPAFYLR NOT equal
Pavel's list:
      Peptide N_overhang C_overhang      spikeTideResult              spikeTide
350 YFNPAFYLR       GIWK    RPRRLAL fully_representative GIWK.YFNPAFYLR.RPRRLAL

Pbase result:
            Peptide N_overhang C_overhang      spikeTideResult          spikeTide
AT3G21160 YFNPAFYLR       GIWK        RPR fully_representative GIWK.YFNPAFYLR.RPR

###############################################################################

 IGGGGEGSWFR NOT equal
Pavel's list:
        Peptide N_overhang C_overhang      spikeTideResult              spikeTide
388 IGGGGEGSWFR     DFLKPR        CFY fully_representative DFLKPR.IGGGGEGSWFR.CFY

Pbase result:
              Peptide N_overhang C_overhang      spikeTideResult            spikeTide
AT5G55500 IGGGGEGSWFR       LKPR        CFY fully_representative LKPR.IGGGGEGSWFR.CFY

All but the workhorse function .addOverhang is just crap. dummyAddOverhang is just for testing purposes and should be replaced by a vectorized wrapper function. But therefore we need to fix #5 first. To summarize it: the logic is done but the interface is missing.

How to create the user interface/API?

  1. What is the input of this function? A Pbase object (containing the fasta file) and a named list of target peptides? It is not very intuitive to call something like calculateHeavyLabels(pbaseObject, peptides=c(AT5G55500="ABCDEF", AT5G55500="GHIJKLM")).
  2. What is the output of this function? Do we store it in the Pbase object? Do we just return a data.frame similar to @pavel-shliaha's result_26_09_13.csv?
lgatto commented 10 years ago

How to create the user interface/API? What is the input of this function? A Pbase object (containing the fasta file) and a named list of target peptides? It is not very intuitive to call something like calculateHeavyLabels(pbaseObject, peptides=c(AT5G55500="ABCDEF", AT5G55500="GHIJKLM")). What is the output of this function? Do we store it in the Pbase object? Do we just return a data.frame similar to @pavel-shliaha's result_26_09_13.csv?

What will the function be used for? If it is only to get the sequences, not need for a Proteins object. If there is a need for visualisation, comparing against other references or data sets, ... Proteins is probably an option to consider. But I would suggest a data.frame for now.

pavel-shliaha commented 10 years ago

Sebastian is right that he should be looking for R or K, but not followed by P to extend the overhangs. For example

AAAK.AAAAK.AKA

C overhang needs to be extended while

AAAK.AAAAK.AKP

C overhang does not need to be extended. This indeed explains all the differences.

unfortunately I would like to suggest additional small change (sorry about that). One of my colleagues has asked me to perform PRM on non-proteotypic peptide. The current implementation of .addOverhang will report is as non-prototypic. Could the function be altered slightly to have additional parameter: protein = NULL. So if the user leaves this as NULL and there are two proteins in the database sharing this peptide the function will return "non-proteotypic", but if the user specifies protein 1 for example the function will just work with protein 1, and not even consider the second one. I know it is quite strange to do an MRM on non-proteotypic peptide, but sometimes the database has lots of entries for the same protein (Swiss-Prot and Trembl) or several isoforms or closely related proteins. . The problem is how to reference a protein. Since protein names are usually complicated, having a sort of functional description in them:

sp|B1X797|6PGL_ECODH 6-phosphogluconolactonase OS=Escherichia coli (strain K12 / DH10B) GN=pgl PE=3 SV=1

I suggest that the protein is referenced as protein = "B1X797" here and the function looks for a protein which name has a pattern of "B1X797" using for example grep ()