rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

compareSpectra limits using joinPeaksGnps #239

Open LiesaSalzer opened 2 years ago

LiesaSalzer commented 2 years ago

Hi, I was testing the GNPS functionality from MsCoreUtils with compareSpectra. However, when I have a lot of MS2 spectra to compare (1603 MS2), it seems that computational limits are reached (plus computations take super long) and I get following Error message:

> GNPS_score <- compareSpectra(ms2_spectra_comb,
+                              MAPFUN = joinPeaksGnps,
+                              FUN = gnps,
+                              tolerance = tolerance,
+                              ppm = ppm, 
+                              type = "inner")
Error in solve_LSAP(score_mat, maximum = TRUE) : 
  long vectors (argument 1) are not supported in .C

I have no idea, if this can be solved somehow, but I just wanted to let you now.

LiesaSalzer commented 2 years ago

Plus maybe it would be useful to have kind of a progress bar that shows if the code is still running or if it is maybe stuck somewhere?

jorainer commented 2 years ago

Hm, the solve_LSAP is called in gnps to find the best match. From the error message it seems to complain that the score matrix (score_mat) is too large? Can you check with max(lengths(ms2_spectra_comb)) what the largest number of peaks in a spectrum is for your dataset?

Regarding a progress bar - yes, agree that that might be helpful - I'm just a little afraid this will slow down calculations even more ... I'll have a look into the function to see what we can do there...

LiesaSalzer commented 2 years ago

max(lengths(ms2_spectra_comb)) was an excellent idea! I realized I still had a lot of noise in my MS2 spectra because the largest number of peaks was 41494. Therefore, its not surprisingly that the GNPS calculations took forever...

After that I applied a 10 % intensity filter which reduced the number of peaks to 247 - And with that the similarity calculation was successful :)

So maybe it would make sense to include that information in the compareSpectra function? e.g. something like

if (max(lengths(sps)) > 1000)
warning ("Spectra contain a lot of peaks/ noise. Consider 'filterIntensity' to reduce calculation time")
jorainer commented 2 years ago

This would be a good idea - only, I am a little hesitant to add this additional check, because a lengths call would actually loop over all spectra (eventually needing them to be loaded into memory from mzML files or retrieved from the database) to determine the number of peaks.