statisticalbiotechnology / diffacto

Other
13 stars 8 forks source link

Enzyme cleavage rules for peptide-to-protein mapping #7

Closed markmipt closed 6 years ago

markmipt commented 6 years ago

Hi all,

I've noticed that peptide mapping to protein sequences (_map_seq function) is doing in simple way without taking into account any cleavage rules. I've made a theoretical calculations for such mapping on swissprot database and found that there are ~10% matches of peptides (with length >= 6) which are belong to protein sequences but do not belong to the list of tryptic peptides of these proteins. What means that there can be some "false" peptides used for quantitation.

Also, such simple mapping increase calculation time exponentially with increasing the number of peptides in analysis.

So, I'm not sure, but it seems that diffacto efficiency and performance can be increased by adding cleavage rules in the _map_seq . Of course, all of these affect only the cases when user does not have "protein ID(s)" column in input file.

I have Python code for enzyme mapping in my own projects, so it will be easy to implement here if you think that it will be useful.

Regards, Mark

userbz commented 6 years ago

Hi Mark, You point out an important question about protein inference, which is an interesting topic itself. Indeed, the _map_seq function was over-simplified. As you mentioned that it is only applied when protein identity information is missing, for example, an unknown mixture of proteomes. On the other hand, I thought the original purpose of applying Diffacto is, to some extent, tolerating incorrect peptide identities. One demonstration was as described in the original paper, that even the approach of combining de novo sequencing and BLAST can still yield a reasonable result.

Of course, we would be grateful if you could suggest a revision of the function for a better performance and robustness.

Best regards, Bo