Closed sgibb closed 9 years ago
How to create the user interface/API? What is the input of this function? A Pbase object (containing the fasta file) and a named list of target peptides? It is not very intuitive to call something like calculateHeavyLabels(pbaseObject, peptides=c(AT5G55500="ABCDEF", AT5G55500="GHIJKLM")). What is the output of this function? Do we store it in the Pbase object? Do we just return a data.frame similar to @pavel-shliaha's result_26_09_13.csv?
What will the function be used for? If it is only to get the sequences, not need for a Proteins
object. If there is a need for visualisation, comparing against other references or data sets, ... Proteins
is probably an option to consider. But I would suggest a data.frame
for now.
Sebastian is right that he should be looking for R or K, but not followed by P to extend the overhangs. For example
AAAK.AAAAK.AKA
C overhang needs to be extended while
AAAK.AAAAK.AKP
C overhang does not need to be extended. This indeed explains all the differences.
unfortunately I would like to suggest additional small change (sorry about that). One of my colleagues has asked me to perform PRM on non-proteotypic peptide. The current implementation of .addOverhang will report is as non-prototypic. Could the function be altered slightly to have additional parameter: protein = NULL. So if the user leaves this as NULL and there are two proteins in the database sharing this peptide the function will return "non-proteotypic", but if the user specifies protein 1 for example the function will just work with protein 1, and not even consider the second one. I know it is quite strange to do an MRM on non-proteotypic peptide, but sometimes the database has lots of entries for the same protein (Swiss-Prot and Trembl) or several isoforms or closely related proteins. . The problem is how to reference a protein. Since protein names are usually complicated, having a sort of functional description in them:
sp|B1X797|6PGL_ECODH 6-phosphogluconolactonase OS=Escherichia coli (strain K12 / DH10B) GN=pgl PE=3 SV=1
I suggest that the protein is referenced as protein = "B1X797" here and the function looks for a protein which name has a pattern of "B1X797" using for example grep ()
This PR implements @pavel-shliaha's feature request https://github.com/sgibb/cleaver/issues/5 in Pbase. It is WIP and I want to discuss some points before thinking about merging this feature into the master branch.
The workhorse function
.addOverhang
works quite well. Based on @pavel-shliaha's testfiles: proteins_of_interest.fasta and result_26_09_13.csv I compared the output of.addOverhang
(using test_heavylabels.R; I will write a proper unit test if all questions are answered). I get an overlap of 99%. For only 5 peptides (out of 389) I get different results. But IMHO it is because @pavel-shliaha looks forK
orR
to cleave and I am looking forK
orR
not followed byP
(I used this simpler rule instead of the defaultcleaver
rule because it seems that the input was cleaved with this rule but the implementation of.addOverhang
is totally independent of any cleaving rule). @pavel-shliaha could you please verify that these 5 peptides are correct:All but the workhorse function
.addOverhang
is just crap.dummyAddOverhang
is just for testing purposes and should be replaced by a vectorized wrapper function. But therefore we need to fix #5 first. To summarize it: the logic is done but the interface is missing.How to create the user interface/API?
calculateHeavyLabels(pbaseObject, peptides=c(AT5G55500="ABCDEF", AT5G55500="GHIJKLM"))
.data.frame
similar to @pavel-shliaha's result_26_09_13.csv?