rformassspectrometry / CompoundDb

Creating and using (chemical) compound databases
https://rformassspectrometry.github.io/CompoundDb/index.html
17 stars 16 forks source link

Use case: find compounds with a certain MS2 peak #28

Open jorainer opened 6 years ago

jorainer commented 6 years ago

Retrieve compounds (and/or spectra) that have an MS2 peak with a certain m/z.

The query would be something like:

compounds(compdb, filter = MSnMzFilter(mz = 123.345, ppm = 10))

In the current database layout we can not query on the m/z of the individual peaks as the m/z and intensity values are stored as a blob (for performance reasons; discussed in issue #26). We thus have to: 1) get all potential MS2 spectra that could have a peak at the position 2) for each MS2 spectrum, check if any of it's peaks matches the query m/z

To speed up point 1) (i.e. to not have to retrieve all MS2 spectra): add columns msms_mz_range_min and msmsm_mz_range_max columns to the spectrum table to retrieve only spectra for which the m/z range overlaps the input m/z. This picks up also the idea from @SiggiSmara to speed up the query based on m/z ranges (https://github.com/EuracBiomedicalResearch/CompoundDb/issues/26#issuecomment-412004756).

For point 2): implement a hasPeak(x, mz, ppm = 10, which = c("any", "all")) method for Spectra that returns TRUE or FALSE if the Spectrum has a peak at the given m/z(s). mz can have length > 1.

jorainer commented 6 years ago

Just added hasMz,Spectrum and hasMz,Spectra methods that allow to test whether peak(s) with a certain m/z are present in a spectrum:

library(CompoundDb)
sp1 <- new("Spectrum2", mz = c(23.231, 123.43, 255.231, 432.0952),
           intensity = c(123, 3432, 45432, 423))
sp2 <- new("Spectrum2", mz = c(123.099, 344.531, 453.2313),
           intensity = c(231, 431, 413))
sp3 <- new("Spectrum2", mz = c(123.1001, 343.4321, 432.0921),
           intensity = c(542, 4524, 32))
spl <- Spectra(sp1, sp2, sp3)

mzs <- c(123.1, 432.0931)

## Is any of the mzs present in a spectrum?
hasMz(spl, mzs)
[1]  TRUE FALSE  TRUE

## Is there any spectrum that contains peaks matching **all**  input m/z?
hasMz(spl, mzs, which = "all")
[1] FALSE FALSE  TRUE

Parameter ppm allows to define acceptable difference between query and target m/z.

Would this eventually be something which is also interesting for MSnbase @lgatto?