Closed jschmacht closed 3 years ago
Are you instantiating your MzIdentML
object with mzid.read
? If so, you need to explicitly pass use_index=True
. This will perform a fast pre-scan of the XML file, building a byte offset index to speed up future random access reads. This primarily matters for retrieving referenced entities, but should recently have also improved iterfind
's speed too. I'm not sure if that iterfind
optimization has been released yet though it is already in master
.
If you don't want to expand all referenced tags, you can pass retrieve_refs=False
to reduce the number of extra queries the reader performs. You're correct, it's impossible to reach "up" the tree when retrieve_refs
pulls in a reference to another tag.
The best approach would be to perform two passes:
from pyteomics import mzid
# Set retrieve_refs to False to reduce extraneous work by not retrieving all references automatically.
reader = mzid.MzIdentML("path/to/mzid", retrieve_refs=False)
# This should be True by default.
assert reader._use_index
spectrum_id_from_sii = {}
for ident_result in reader.iterfind("SpectrumIdentificationResult"):
for sii in ident_result['SpectrumIdentificationItem']:
spectrum_id_from_sii[sii['id']] = ident_result['spectrumID']
# Reset the iterator, returning to the start of the file
reader.reset()
for pag in reader.iterfind("ProteinAmbiguityGroup"):
prot_to_peptides = {}
prot_pass_threshold = {}
for pdh in pag['ProteinDetectionHypothesis']:
accession = reader.get_by_id(pdh['dBSequence_ref']).get("accession")
prot_pass_threshold[accession] = pdh['passThreshold']
# Map peptide evidence to the spectra that they are assigned to
peptide_to_spectra = {}
for pep in pdh['PeptideHypothesis']:
spec_passed = {}
for ref in pep['SpectrumIdentificationItemRef']:
sii = reader.get_by_id(ref['spectrumIdentificationItem_ref'])
# Store spectrumID -> passThreshold for each peptide PSM
spec_passed[spectrum_id_from_sii[sii['id']] = sii['passThreshold']
peptide_to_spectra[pep['peptideEvidence_ref']] = spec_passed
prot_to_peptides[accession] = peptide_to_spectra)
# Do something with prot_to_peptides and prot_pass_threshold
Did this solve the problem, @jschmacht?
Hello, yes that solved my problem, thank you very much. The missing information for me was the 'retrieve_refs=False'. I was wondering how to use the get_by_id function without getting the IDs. It would be great if you can add this in the documentation.
Hi, we would like to use the pyteomics mzident parser for our metaproteomic software Prophane. We require the following infomation about protein groups (“ProteinAmbigousGroup”):
This info is in the following mzIdent sections:
So far, we tried using the iterfind("ProteinAmbgiousGroup") function of the pyteomics.mzid.MzIdentML class but ran into the following issues:
Any ideas how to address this are highly appreciated. Is it possible to adapt the pyteomics.mzid.MzIdentML for our task?