levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

issues in mzTab when description of protein contains [] #3

Closed ypriverol closed 4 years ago

ypriverol commented 4 years ago

@levitsky:

We are using pyteomics for some of our pipelines. I have seen that when the description of the protein contains [ ] the current mzTab fail reading the description part.

['PRT', 'sp|Q15119|PDK2_HUMAN', '[Pyruvate dehydrogenase (acetyl-transferring)] kinase isozyme 2, mitochondrial OS=Homo sapiens OX=9606 GN=PDK2 PE=1 SV=2', 'null', 'null', 'uniprot-homo-sapiens_decoy.fasta', 'null', 'null', '0.038215846688334', 'sp|Q15119|PDK2_HUMAN', '33-UNIMOD:35', '0.208845208845209', '9.72674e06', '8.77109875e05', '0.0', '0.0', '0.0', '7.632018e06', '0.0', '0.0', '0.0', '9.72674e06', 'null', 'null', '8.77109875e05', 'null', 'null', '0.0', 'null', 'null', '0.0', 'null', 'null', '0.0', 'null', 'null', '7.632018e06', 'null', 'null', '0.0', 'null', 'null', '0.0', 'null', 'null', '0.0', 'null', 'null', 'indistinguishable_proteins']

it fails in the following line:

cv, acc, name, value = re.split(r"\s*,\s*", tuplet[1:-1]) 
levitsky commented 4 years ago

Hi Yasset,

I am not sure that the code you are showing applies to that line. Would you be able to share a minimal example of a problematic file (or not minimal, depending on how sensitive the data are)?

mobiusklein commented 4 years ago

I think I see the failure path. In _cast_value https://github.com/levitsky/pyteomics/blob/master/pyteomics/mztab.py#L81-L82, there is an indiscriminant decision that if something starts with a "[" it must be a param group. The solution is to try to parse the value as a parameter and if we fail, return the string unparsed, or somehow inject the name of the column being parsed and know whether that column can be a param in that context. Easier to do the former for now.

levitsky commented 4 years ago

@ypriverol This issue got auto-closed by merging #4, but please drop a comment on whether it's actually fixed for you now.

@mobiusklein Thank you for stepping in!