statisticalbiotechnology / maracluster

Matthew The's implementation of MaRaCluster
Apache License 2.0
11 stars 3 forks source link

maracluster consensus produced more consensus spectra than expected #29

Open JannikSchneider12 opened 5 months ago

JannikSchneider12 commented 5 months ago

Hey,

I have a problem with the consensus spectra generation. So I used the output of maracluster batch, filtered the file based on my criteria and write back a tsv file with only my remaining specta where I want to create the consensus spectra. I gave this file to the maracluster consensus function but the resulting file (in my case mgf) contains more spectra than I would expect.

So this is my code for reading in the input files, filter based on my criteria and write the resulting tsv files back:

image

And here are the lenghts differences:

image

Moreover I would have another question. So the title of the consensus spectra is just "scan=XXX" but no parameter for the cluster index. So is the value for "scan=" the cluster index?

Thanks for your time and help

MatthewThe commented 5 months ago

One thing that could play a role here is that the consensus mgf output will make several copies for each consensus spectrum for each of the precursors it could correspond to. This is because the mgf format does not allow specifying multiple precursors per scan. This usually only happens if the input spectrum file was peak picked, so I'm not sure if it applies here.

The scan number of consensus spectra is documented here: https://github.com/statisticalbiotechnology/maracluster/wiki/FAQ

JannikSchneider12 commented 5 months ago

Hey,

thank you for the explanation. And yes I converted the raw files via peak picking. Is there a way to avoid the creation of multiple consensus spectra for the same spectrum or does it make sense to do so, since there could still be different precursor charges although it is not „shown“ in the mgf? (Sorry I am new to mass spectrometry)

thanks again for your time and help

MatthewThe commented 5 months ago

Sorry, I misspoke. I actually meant precursor detection rather than peak picking. Some search engines (e.g. MaxQuant and FragPipe) do this automatically, but many others don't. For search engines that don't do this (e.g. Comet and MSGF+), performing precursor detection (e.g. with Bullseye or Dinosaur/Biosaur) prior to clustering+searching can drastically increase the identification rate.

Can you send me the first 100 scan numbers of your consensus mgf, that would quickly tell me if we indeed have multiple precursors per spectrum.

JannikSchneider12 commented 5 months ago

Here are the first scan numbers of the maracluster output:

image

MatthewThe commented 5 months ago

Yes, you indeed have multiple precursors per consensus spectrum as some scan numbers end with 02.

If you're fine with using a different spectrum format, e.g. .mzML or .ms2, you can get a single spectrum (with multiple precursors) per consensus spectrum.

JannikSchneider12 commented 5 months ago

Thanks again. Just for curiosity but since MaRaCluster gets in my case just the mgf files, I was asking myself where it gets the information from that there might be multiple precursors? And did I get it right that in those cases where multiple consensus spectra are created for the same cluster, then they just differ by the precursor charge?

MatthewThe commented 4 months ago

It's hard to say without seeing the input files. The most likely reason is indeed that some spectra get assigned multiple potential charge states, either by the mass spec or by a preprocessing step. This then would indeed lead to the corresponding consensus spectra having multiple potential charge states.

JannikSchneider12 commented 4 months ago

Alright, then I will try to have a deeper look. Thanks again for your help!