smith-chem-wisc / mzLib

Library for mass spectrometry projects
GNU Lesser General Public License v3.0
26 stars 33 forks source link

Deconvolution #105

Closed stefanks closed 6 years ago

stefanks commented 7 years ago

In imzspectrum interface add a deconvolution method that returns a list of masses. Parameters could specify the max charge possible (useful for ms2 where precursor charges are known), confidence level (for intact only want confidently identified masses, for ms2 ok with a lot of low confidence id's that might even correspond to a single isotope peak). Also a parameter could be the deconvolution result of a neighboring spectrum, which would increase confidence in matched masses (this is useful for intact, but useless for ms2)

acesnik commented 7 years ago

It should also be possible to validate deconvolution results by comparing the spectrum to a theoretical isotopic distribution generated for the atomic composition of the molecule (protein in this case).

stefanks commented 7 years ago

So another method in imzspectrum that takes in a molecule and outputs the confidence that it is present in the scan? What would this be useful for?

stefanks commented 7 years ago

Thermo sometimes provides the charge state guess for some peaks, this knowledge could be leveraged to get masses as well.

acesnik commented 7 years ago

We have done this before manually to validate intact protein identifications. Visually, it is clear in the few cases I've seen when a match is incorrect. We could use it to assign confidence in an intact MS (no fragmentation) identification. It might even be useful for calibrating intact files, since we could filter out low quality identifications.

acesnik commented 7 years ago

We could also do rank analysis like FDR on proteoform identifications from MS1 only with a confidence score. We don't have a metric like that as it stands.

stefanks commented 7 years ago

I see. This is a much easier task than deconvolution, since looking for a known mass is easier than looking for unknown masses.

leahvschaffer commented 7 years ago

A few weeks ago when I tried using the thermo charge state guess it didn't work for intact proteins (returned 0) - I don't know if it's because the peaks were so close together at higher charge states. We could test the confidence method on label free topdown data since we know what the right answer is... it should return a high score for the correct protein and a low score for others of a similar mass/diff sequence. For label-free I think some sort of filter step like this will be necessary becasue of the sheer volume of masses present

stefanks commented 7 years ago

Moved my old attempt to mzLib https://github.com/smith-chem-wisc/mzLib/commit/ced4db7f04072a720d83053ffafbbffec5671edf

The Deconvolute method is in ThermoSpectrum, and it relies on charge guesses provided in Thermo raw files.

stefanks commented 7 years ago

Removed that old attempt, there is new code sitting in a pull request https://github.com/smith-chem-wisc/mzLib/pull/183 What would be some good tests to validate the code? Let's think of a rigid validation way to test it, and if it passes I will include deconvolution in mzLib

acesnik commented 7 years ago

Here are some thoughts on possible tests --

Generate a few theoretical isotopic distributions from a sample of ~100,000 molecules:

  1. Peptide (low mass)
  2. Protein (high mass)
  3. NeuCode-lysine protein (high mass with isotopic labeling, which can lead to monoisotopic errors)
  4. NeuCode-lysine protein mixture (difficult to distinguish distributions)

Checks:

  1. Check that the monoisotopic masses are correct.
  2. Check that the deconvolution software gets two monoisotopic masses for overlapping NeuCode-lysine distributions.
  3. Check that the method gets the correct integrated intensity.
stefanks commented 7 years ago

Say you have 1e6 proteoforms A and 1e7 proteoforms B, injected in a single scan.

What are the intensities of the peaks in the mass spectrum? In theory? I know the m/z values for each isotope for each charge, but what about the intensities? We want to reconstruct the number of proteoforms using intensity measurements, right?

Since there are 10 times more proteoform B than A, the ratios of some aggregated intensity measurements should be 1:10. Is it the ratio of the sum of all relevant peak intensities across all charge states? Or the ratio of the summed peak intensities of the most abundant charges? Or the ratio of the most abundant peak intensities?

Or maybe this can/should be relegated to FlashFLQ, and not be a part of deconvolution at all?

stefanks commented 7 years ago

Or maybe they ionize differently, and the ratios of amounts have nothing to do with intensities?

Then, say have another condition with 2e6 of A and 1e7 of B. What's the formula to compute the 1:2 ratio of A in condition 1 vs condition 2, if the inputs to the formula are peak intensities?

stefanks commented 7 years ago

I guess any of the three methods should give the correct fraction...

So Anthony, what do you mean by correct integrated intensity?

rmillikin commented 7 years ago

I think that deconvolution could be input to FlashLFQ. Sort of treat each mass ID in each spectrum as a PSM. FlashLFQ would do peakfinding and aggregate the intensities together. It would take some effort to get everything to communicate together but I think it would be a good division of labor. I have not looked at any NeuCode data though, not sure how that would interface with ProteoformSuite's current quantification system

stefanks commented 6 years ago

Done