MS2 generation in simulator

sdrogers commented 4 years ago

At the moment the code can generate MS2 spectra from either:

A kernel density
A CRP

To be able to run realistic experiments (e.g. DIA vs DDA) we need to be able to assign a Chemical a real MS2 spectrum. This could come from either

an mzML file (i.e. choosing a random MS2 scan from an mzML file)
an mgf file (i.e. choosing a random spectrum from a mgf file)

For either of these, it looks straightforward: we need a new peak_sampler object. Or the ability in a peak sampler object to return spectra like the above.

Note: I think the peak sampler object is just used to initialise Chemicals? Is that correct? In which case, I'm not sure why it has (un-implemented) noise methods? Noise is an artefact of the data sampling process.

There is an additional use case (for DIA vs DDA experiments):

We have an mzML containing a known mixture. I.e. we know the molecules that are in there (there will also be noise) We have an mgf holding the spectra of these known chemical When we seed the simulator with this data, we want to be able to assign the correct MS2 spectrum to the correct Chemical (and perhaps noisy ones to the other chromatograms)

Maybe this is best done using the known chemicals? We have a DB of known chemicals, and we have a DB of their known spectra. In that setting, can we add "noisy" chemicals? I.e. random extra UnknownChemical objects to test the acquisition / analysis more?

joewandy commented 4 years ago

For either of these, it looks straightforward: we need a new peak_sampler object. Or the ability in a peak sampler object to return spectra like the above. Note: I think the peak sampler object is just used to initialise Chemicals? Is that correct? In which case, I'm not sure why it has (un-implemented) noise methods? Noise is an artefact of the data sampling process.

The peak sampler (misleading name?) is actually what the paper calls the 'database'. It stores the trained KDEs, all the scan data extracted from one or several mzML files, and also the scan duration information, i.e. the (1,1), (1,2), (2,1), and (2,2) information. We've already added a method to return ms2 spectra in the peak sampler. This is used by the ChemicalCreator object used to initialise chemicals. ChemicalCreator has two modes:

generate ms2 spectra following the CRP or
generate ms2 spectra by sampling a random spectra from the data.

In the current skeleton code, that unimplemented noise method is to be called by the Mass Spec when generating scans from chemicals. I forgot why it's there. Maybe the noise should be added as part of the ChemicalCreator instead? @vinnydavies

sdrogers commented 4 years ago

So, we need new methods to be able to get spectra from .mgf or .mzml files (ms2 spectra that is).

Ok for noise. Maybe the MS should pass some kind of noise generator? The noise generator ought to be an attribute of the MS, right?

Dr Simon Rogers Senior lecturer, School of Computing Science, University of Glasgow

On 9 Jul 2020, at 16:32, Joe Wandy notifications@github.com wrote:

For either of these, it looks straightforward: we need a new peak_sampler object. Or the ability in a peak sampler object to return spectra like the above. Note: I think the peak sampler object is just used to initialise Chemicals? Is that correct? In which case, I'm not sure why it has (un-implemented) noise methods? Noise is an artefact of the data sampling process.

The peak sampler (misleading name?) is actually what the paper calls the 'database'. It stores the trained KDEs, all the scan data extracted from one or several mzML files, and also the scan duration information, i.e. the (1,1), (1,2), (2,1), and (2,2) information. We've already added a method to return ms2 spectra in the peak sampler. This is used by the ChemicalCreator object used to initialise chemicals. It has two modes:

generate spectra following the CRP or generate spectra by sampling the spectra. In the current skeleton code, that unimplemented noise method is to be called by the Mass Spec when generating scans from chemicals. I forgot why it's there. Maybe the noise should be added as part of the ChemicalCreator instead? @vinnydavies

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

vinnydavies commented 4 years ago

By "generate ms2 spectra by sampling a random spectra from the data.", that means sample N individual fragments right? The new method would need to get all the frags that come from one real ms2 scan

The mass spec has some noise generation already for the intensity, we can add more functionality there

joewandy commented 4 years ago

The new method would need to get all the frags that come from one real ms2 scan

Yeah this here already gets all the frags that come from one real ms2 scan.

sdrogers commented 4 years ago

@joewandy what about the final part of my comment? When we have a real molecule and real spectrum?

joewandy commented 4 years ago

Continue in issue #102

glasgowcompbio / vimms

MS2 generation in simulator #13

Ok for noise. Maybe the MS should pass some kind of noise generator? The noise generator ought to be an attribute of the MS, right?