Support for re-mapping of peptides to proteins based on FASTA file

jgriss commented 7 years ago

Hi David,

Is there a method in maspy to map identified peptides to proteins based on a FASTA file? Ie. redo the mapping done by the search engine?

Thanks!

Johannes

hollenstein commented 7 years ago

Hi, yes this is actually quite simple.

You can use the function maspy.proteindb.importProteinDatabase() to import a protein database from a fasta and perform an in silico digestion. You can specify enzyme specificity with regex patterns (most are predefined and can be found in maspy.constants.expasy_rules). In addition you should specify some of the search settings; whether protein n-terminal methionine removal should be considered, min and max length of peptides, number of allowed missed cleavage events and if leucine and isoleucine should be treated as indistinguishable.

Then you can access an peptide entry by the sequence and retrieve the protein ids. The peptide entries are instances of the maspy.proteindb.PeptideSequence class.

For example:

import maspy.proteindb
proteindb = maspy.proteindb.importProteinDatabase(
    fastaPath, minLength=6, maxLength=50, contaminationTag='[CONT]'
)

peptideEntry = proteindb.peptides['PEPTIDEK']
proteinIds = peptideEntry.proteins

jgriss commented 7 years ago

Hi,

Thanks a lot for the quick answer!

One additional question: the "contaminationTag" parameter: Is this parameter always expected to preceed the protein description line or can it occur anywhere in it?

For example:

>CONTAM_sp|P1234|Some contaminant

vs.

>sp|P12324|CONTAM_ Some contaminat

Thanks!

Johannes

hollenstein commented 7 years ago

Hi, the main purpose of the "contaminationTag" parameter is to allow proper parsing of the fasta header line. Fasta headers are parsed automatically using a function from Pyteomics, which recognizes and parses fasta headers in the formats UniProtKB, UniRef, UniParc and UniMES. This does not work if the whole header is preceded by a tag, like in your example "CONTAM_". Thus, if a header starts with this tag, it is removed for parsing and then added back to the id string. Your second example should be no problem for parsing, because it only changes the protein gene id, but not the overall structure of the header.

It is still possible that parsing does not work and an exception is thrown, if some of the fasta file entries have an unrecognized fasta header. In this case you have two options. First you can simple set the parameter "forceId" to true, which uses the whole header string as proteinId, if it can't be parsed. Or second you can define your own fasta parsing function, which could look something like the function maspy.proteindb.fastaParseSgd().

I'm currently refactoring this module, and I realized that even if a protein entry is recognized as a contaminant this information is not stored anywhere.

Maybe I will add a variable to the ProteinSequence class or define a set of contaminant proteinIds in the ProteinDatabse class in the future.

Best regards David

jgriss commented 7 years ago

Hi,

Thanks a lot! This also answers my second question whether this information (that the protein is a contaminant) is stored somewhere.

Hopefully last question: Do you store the "decoy" status of proteins somewhere?

Reason for my question is that my fasta header contains the decoy string as part of the accession. Otherwise, many other tools that expect a certain header style as you described break.

>tr|REVERSED_Q3AR72|Some protein name

hollenstein commented 7 years ago

Hi, at the moment the decoyTag parameter is treated exactly as the contaminationTag parameter. Actually the decoy tag is expected to preceed a contamination tag... So, no it is not stored and not even recognized.

I think your decoy tag already breaks the pyteomics parsing, altough it does not throw an exception =)

I think that after the second "|" it expects an entry with an underline character. This is how your header is parsed:

maspy.proteindb._extractFastaHeader('>tr|REVERSED_Q3AR72|Some protein name')
{'description': 'Some protein name',
 'gene': 'REVERSED_Q3AR72',
 'gene_id': 'REVERSED',
 'id': 'tr',
 'taxon': 'Q3AR72'}

Which is actually a problem, because the "id" key is used to store the protein, i.e. in your case there will be multiple entries with "tr" as proteinId that overwrite each other...

Since there is no common way of indicating decoy sequences it is difficult to come up with a common parsing function. You could use a format like this, which does not break the parsing (it seems that the "OS=something" is necessary for proper parsing).

maspy.proteindb._extractFastaHeader('>tr|REVERSED_Q3AR72|GeneId_Taxon Some protein name OS=Escherichia')
{u'OS': u'Escherichia',
 'db': u'tr',
 'entry': u'GeneId_Taxon',
 'gene_id': u'GeneId',
 'id': u'REVERSED_Q3AR72',
 'name': u'Some protein name',
 'taxon': u'Taxon'}

Anyway, if the parsing works you could just set a flag manually like this:

for proteinId in proteindb.proteins:
    proteinEntry = proteindb.proteins[proteinId]
    if proteinId.startswith('REVERSED_'):
        proteinEntry.isDecoy = True
    else:
        proteinEntry.isDecoy = False

jgriss commented 7 years ago

Thanks! In this case I'll stick with my own FASTA parser for now :-)

hollenstein commented 7 years ago

Hi Johannes!

As I mentioned before I'm currently refactoring this module. At the moment the new version of the module is called "_proteindb_refactoring.py".

I've just updated the function importProteinDatabase() to support identifying decoy and contaminants by looking for the specified tags in the whole fasta header sequence. If a decoy or contamination tag is found, the ProteinEntry variable ".isDecoy" or ".isContaminant" is set to True.

I guess this should provide the functionality you were looking for. I've also added another custom fasta header parser (adapted from your function from spectra-cluster-py), in case the pyteomics standard parser does not work.

I would be happy if you could give it a try and provide some feedback if this works as intended.

Best regards, David

jgriss commented 7 years ago

Hi David,

I'm now testing your _proteindb_reactoring package and am getting this exception:

  File "/home/jg/Projects/maspy/maspy/_proteindb_refactoring.py", line 318, in importProteinDatabase
    for fastaHeader, sequence in _readFastaFile(filePath):
  File "/home/jg/Projects/maspy/maspy/_proteindb_refactoring.py", line 427, in _readFastaFile
    line = openfile.next()
AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

The reason for this error seems to be that the next method was removed in Python 3.x

jgriss commented 7 years ago

This seems to fix it for me:

diff --git a/maspy/_proteindb_refactoring.py b/maspy/_proteindb_refactoring.py
index 52d33af..f524049 100644
--- a/maspy/_proteindb_refactoring.py
+++ b/maspy/_proteindb_refactoring.py
@@ -421,15 +421,19 @@ def _readFastaFile(filepath):
     """
     processSequences = lambda i: ''.join([s.rstrip() for s in i]).rstrip('*')
     processHeaderLine = lambda line: line[1:].rstrip()
-    with io.open(filepath, 'r') as openfile:
+
+    with open(filepath, "r") as openfile:
         #Iterate through lines until the first header is encountered
-        try:
-            line = openfile.next()
-            while line[0] != '>':
-                line = openfile.next()
-            header = processHeaderLine(line)
-            sequences = list()
-        except StopIteration:
+        header_found = False
+
+        for line in openfile:
+            if line[0] == '>':
+                header_found = True
+                header = processHeaderLine(line)
+                sequences = list()
+                break
+
+        if not header_found:
             errorText = 'File does not contain fasta entries.'
             raise maspy.errors.FileFormatError(errorText)
         #Read remaining lines

hollenstein commented 7 years ago

Hi, sorry for this stupid mistake, thanks for the report.

I completely forgot to test the module with python 3, I'll do this later.

For now I've updated to code for Python 2 and 3 compatibility from

openfile.next()

to

next(openfile)

jgriss commented 7 years ago

Hi,

Everything working perfectly now. Together with your answer in #2 everything's settled for me.

hollenstein / maspy

Support for re-mapping of peptides to proteins based on FASTA file #1