mzident parser - Githubissues

jschmacht commented 3 years ago

Hi, we would like to use the pyteomics mzident parser for our metaproteomic software Prophane. We require the following infomation about protein groups (“ProteinAmbigousGroup”):

the protein accessions per protein group
the number and spectra identifiers of PSMs assigned to each accession

This info is in the following mzIdent sections:

the “DBSequence.accession” of the “ProteinDetectionHypothesis” from the “ProteinAmbgiousGroup”
the spectrumID of the “SpectrumIdentificationResult” belonging to the “SpectrumIdentificationItem”
- ProteinDetectionHypothesis.passThreshold, cvParam "protein group passes threshold" from “ProteinAmbigousGroup”, and “SpectrumIdentificationItem.passThreshold”

So far, we tried using the iterfind("ProteinAmbgiousGroup") function of the pyteomics.mzid.MzIdentML class but ran into the following issues:

it takes too long for large mzident files (>2GB) (several hours), probably due to a lot of not needed information being extracted
it is impossible to get the ‘spectrumID’ of the “SpectrumIdentificationResults” parent of the “SpectrumIdentificationItems”.

Any ideas how to address this are highly appreciated. Is it possible to adapt the pyteomics.mzid.MzIdentML for our task?

mobiusklein commented 3 years ago

Are you instantiating your MzIdentML object with mzid.read? If so, you need to explicitly pass use_index=True. This will perform a fast pre-scan of the XML file, building a byte offset index to speed up future random access reads. This primarily matters for retrieving referenced entities, but should recently have also improved iterfind's speed too. I'm not sure if that iterfind optimization has been released yet though it is already in master.

If you don't want to expand all referenced tags, you can pass retrieve_refs=False to reduce the number of extra queries the reader performs. You're correct, it's impossible to reach "up" the tree when retrieve_refs pulls in a reference to another tag.

The best approach would be to perform two passes:

from pyteomics import mzid

# Set retrieve_refs to False to reduce extraneous work by not retrieving all references automatically.
reader = mzid.MzIdentML("path/to/mzid", retrieve_refs=False)
# This should be True by default.
assert reader._use_index

spectrum_id_from_sii = {}
for ident_result in reader.iterfind("SpectrumIdentificationResult"):
    for sii in ident_result['SpectrumIdentificationItem']:
        spectrum_id_from_sii[sii['id']] = ident_result['spectrumID']

# Reset the iterator, returning to the start of the file
reader.reset()
for pag in reader.iterfind("ProteinAmbiguityGroup"):
    prot_to_peptides = {}
    prot_pass_threshold = {}
    for pdh in pag['ProteinDetectionHypothesis']:
        accession = reader.get_by_id(pdh['dBSequence_ref']).get("accession")
        prot_pass_threshold[accession] = pdh['passThreshold']
        # Map peptide evidence to the spectra that they are assigned to
        peptide_to_spectra = {}
        for pep in pdh['PeptideHypothesis']:
            spec_passed = {}
            for ref in pep['SpectrumIdentificationItemRef']:
                sii = reader.get_by_id(ref['spectrumIdentificationItem_ref'])
                # Store spectrumID -> passThreshold for each peptide PSM
                spec_passed[spectrum_id_from_sii[sii['id']] = sii['passThreshold']
            peptide_to_spectra[pep['peptideEvidence_ref']] = spec_passed
        prot_to_peptides[accession] = peptide_to_spectra)
    # Do something with prot_to_peptides and prot_pass_threshold

mobiusklein commented 3 years ago

Did this solve the problem, @jschmacht?

jschmacht commented 3 years ago

Hello, yes that solved my problem, thank you very much. The missing information for me was the 'retrieve_refs=False'. I was wondering how to use the get_by_id function without getting the IDs. It would be great if you can add this in the documentation.

levitsky / pyteomics

mzident parser #44