sgibb / MALDIquant

Quantitative Analysis of Mass Spectrometry Data
https://strimmerlab.github.io/software/maldiquant/
60 stars 25 forks source link

Massive memory usage while using mergeMassPeaks and FilterPeaks #64

Open dsammour opened 4 years ago

dsammour commented 4 years ago

Hi Sebastian,

I am currently working with massive MassPeaks lists of MALDI-FTICR data. By massive I mean

>length(e$msDataPeaks)
[1] 41371 # number of spectra
> mean(lengths(e$msDataPeaks))
[1] 2027.565 # average number of peaks per spectrum

Everything works flawlessly, but I noticed i) a huge memory usage (can reach up to 120 GBs!) when calling mergeMassPeaks and ii) huge memory usage + sometimes error messages when calling filterMassPeaks. Both were called after binPeaks. The error message is as follows:

Fehler in which(is.na(m)) : 
  lange Vektoren noch nicht unterstützt: ../../src/include/Rinlinedfuns.h:138

I know that internally both functions construct intensity matrices which blows up memory usage. Did you ever face such issues? What could you recommend in such situation?

Suggestion For the internal construction of the intensity matrices, do you think it would be a better idea to construct spars matrices? for examples instead of the current implementation of .as.matrix.MassObjectList to use something like this :

.mass = unlist(lapply(focusRegion, MALDIquant::mass))
.intensity = unlist(lapply(focusRegion, MALDIquant::intensity))
.uniqueMass = sort.int(unique(.mass))
n = lengths(focusRegion)
r = rep.int(seq_along(focusRegion), n)
i = findInterval(.mass, .uniqueMass)
sparmat = Matrix::sparseMatrix(i = r, j = i, x = .intensity,                                     
                  dimnames = list(NULL, .uniqueMass), 
                  dims = c(length(focusRegion), length(.uniqueMass)))

what do you think?

sgibb commented 4 years ago

Did you ever face such issues? What could you recommend in such situation? No. But I never had so much data.

Sparse matrices could be a solution especially with the on-disk vector feature.

YonghuiDong commented 3 years ago

@dsammour and @sgibb, Sorry for putting my inappropriate question here.

I also want to analyze my MALDI-FTICR data (MALDI profiling data, not MALDI imaging data ) with MALDIquant. But I don't know how to convert them into the data types that are supported by MALDIquantForeign. How did you convert your data? Thanks.

Dong

dsammour commented 3 years ago

Hi @YonghuiDong, could you please open an issue in MALDIquantForeign and provide more details about the data structure, perhaps with an example. Thanks.

YonghuiDong commented 3 years ago

Hi @dsammour , Thanks for your suggestion. I have opened an issue in MALDIquantForeign. Could you please have a look.

https://github.com/sgibb/MALDIquantForeign/issues/31

Thanks a lot.

Dong

paoloinglese commented 2 years ago

Please have a look at PR #71 I've optimized the speed and memory usage for filterPeaks. The bottleneck in that function is the binary matrix for peaks occurrence.

sgibb commented 2 years ago

Thanks to @paoloinglese this is solved for filterPeaks in #71 and #72, respectively (and just merged into master, not on CRAN yet).