Open dsammour opened 5 years ago
Dear Denis,
thanks for you feature request. MALDIquant
was written because we had to analyse MALDI data from different MS devices with different resolution. That is the reason why we choose to store the m/z for each spectrum separately.
I see at least two solutions to the problem.
MassSpectrum
/MassPeaks
object that store their data on the disk instead of in memory.I would prefer the first method because it greatly reduce the memory usage. The second method roughly halfs the memory usage.
You could also try to remove useless metaData
(unfortunately they are redundant too).
Unfortunately I am not involved in any MS/MSI project anymore. While I am still maintaining MALDIquant
(fixing bugs/writing little new features if needed) I don't have enough time for such a large feature.
If you like to develop one of the above solutions I am happy to assist/review/accept pull request. If you don't want/can't develop something like this you may want to have a look at Cardinal which was written for large IMS data sets and keeps everything on disk.
Best wishes,
Sebastian
Hi Sebastian,
Thanks for your reply. I also find the first option more compelling, in particular, there is a package called matter
which is used internally in Cardinal
. Theoretically, it can interface with the imzML binary file directly, sparing the need for file conversion and saving, but I am not sure whether writing back to imzML is a good idea. I'll try and give it a shot sometime later to create on-disk objects storing only intensity data that can interface with MALDIquant
using createMassSpectrum
so that the rest of the functions can be directly applied. Meanwhile, it would be nice if you just store this issue as a feature request for the future.
Best, Denis
I don't think we have to rewrite the data to imzML (btw: this is how MSnbase::OnDiskMSnExp
works for mzML). I am not sure whether it is a good idea but the first time I thought about an on-disk object I was playing with a filesystem-based structure:
spectrum1/ # a folder
spectrum1/mz # binary file containing the mz values
spectrum1/intensity # binary file containing the intensity values
spectrum1/metaData # text INI/JSON/... file with metadata
spectrum2/
spectrum2/mz
spectrum2/intensity
spectrum2/metaData
...
If the m/z are identical for all spectra we could just create symlinks or keep them in memory. While I thought it is a simple way it may be better to use a more standardized format (imzML, mzDB, ...).
As I said above regardless of the underlying solution I would be more than happy to accept pull requests that implementing an on-disk feature.
Hi Sebastian,
Thanks again for your comments.
I have been reading the documentation of matter
package and I found it it to be very simple and versatile. In section 7 of this document they describe the possibility to extend classes with matter
objects. I don't have much experience in dealing with classes in R, so I could use some help in the following:
To replicate the current class structure of MassSpectrum
, I will need to have one object per spectrum which will later be stored in a list, say, a list of MassSpectrumOnDisk
objects with only the metaData
slot is being stored in memory while mass
and intensity
slots will be saved on disk. Ideally, to make all methods of MassSpectrum
available for this new class, it should inherit from MassSpectrum
class. However, for the on-disk functionality matter
requires mass
and intensity
slots to be of type matter_df
(or any other chosen matter
object). Is it generally possible to update the type of slots for inheriting classes? Do you see another alternative?
Thanks, Denis
In R
it is not possible to overwrite the type of slots in inherited classes. An alternative approach (that requires the change of the parent class AbstractMassObject
and will break backward compatibility but this feature is worth to break it) are class unions as members, see setClassUnion
, e.g.:
## create class unions (number or character/ mz&intensity or filename)
setClassUnion("NumericOrCharacter", c("numeric", "character"))
setClass("AbstractMassObject",
slots=list(mass="NumericOrCharacter", intensity="NumericOrCharacter",
metaData="list"),
contains="VIRTUAL")
setClass("MassSpectrum",
slots=list(mass="numeric", intensity="numeric"),
prototype=list(mass=numeric(), intensity=numeric(),
metaData=list()),
contains="AbstractMassObject")
setClass("OnDiskMassSpectrum",
slots=list(mass="character", intensity="character"),
prototype=list(mass=character(), intensity=character(), metaData=list()),
contains="AbstractMassObject")
ms <- new("MassSpectrum", mass=1:10, intensity=1:10)
oms <- new("OnDiskMassSpectrum", mass="filename", intensity="filename")
A possible class hierarchy would be:
AbstractMassObject (VIRTUAL, mass/intensity = "numeric"/"matter_vec", metaData="list/matter")
├── AbstractMassSpectrum (VIRTUAL, just for method implementation)
│ ├── MassSpectrum (mass/intensity = "numeric"/"double", metaData="list")
│ ├── MassSpectrumOnDisk (mass/intensity = "matter_vec", metaData="matter")
└── MassPeaks (mass/intensity = "numeric"/"double", metaData="list")
While I find matter
is a really great, impressive package the decision to depend on matter
is a tough one. Currently MALDIquant
is on CRAN and matter
on bioconductor. If we import matter
the user can't install MALDIquant
via install.packages("MALDIquant")
but has to use install.packages("BiocManager"); BiocManager::install("MALDIquant")
. IMHO there are at least two solutions:
MALDIquant
to bioconductor.matter
and use something else.I don't know which of the solutions is harder. I don't have any experience on moving from CRAN to bioconductor (IMHO it was the wrong decision in 2011 to send the package to CRAN and not to bioconductor ...).
EDIT: Maybe we could put matter
into Suggests:
. Then everybody could use MALDIquant
in memory and if you want to create and use an on-disk object the constructor could show a message how to install matter
.
Here is an short example implementation of an on-disk class using the binary filesystem structure I mentioned earlier: https://gist.github.com/sgibb/b46a543d4bd1476a2abd412d23bf3780
I don't know whether there are any performance benefits using this approach or matter
.
Unfortunately methods like plot
, transformIntensity
etc won't work because I used to access the slots directly via @mass
/@intensity
instead of their accessor methods mass()
/intensity()
to avoid method dispatching and save some time (it was a bad decision but it should be solvable with a sed
command). Also most of the methods have to be carefully analyzed and partly rewritten to not access mass()
/intensity()
multiple times and use internal/local variables instead (wasn't needed by accessing the slot directly).
Thanks Sebastian for investing your time in the clarifications above.
The class hierarchy mentioned above makes a lot of sense. Regarding the whole Bioconductor
thing, you may find this thread very interesting. It seems that one can set the repositories which install.packages
uses (see ?setRepositories
). One answer also mentions that one need only to add a line containing biocViews:
in the DESCRIPTION
file. Alternatively, as you mentioned, adding matter
to Suggests:
should be more than enough.
The following suggestion would probably belong to MALDIquantForeign
instead. IMHO the most elegant solution for importing large imzML data, is actually not to import them altogether! Ideally, and having the implementation of the class hierarchy mentioned above by you, the raw data could be kept on disk (within the ibd file) and "attached" to the session via matter
. In this situation, the loaded dataset will consist of a list of MassSpectrumOnDisk
objects whose mass
and intensity
slots are matter_vec
objects pointing to different offsets within the same ibd
file. The metaData
slot can be still loaded into memory because of its small footprint. To be safe, the ibd
will be copied beforehand with the copy being the temporary file that is attached. Do you think this is possible?
I find the solution above intriguing, however, at the moment I can't invest time to try it out (not in this year at least), but I will definitely inform you if I get something going regarding this.
Thanks, Denis
Thank you for the thread.
Your suggestion to use the ibd file instead of inventing a new file format is great! I would definitively do that.
My approach has the disadvantage that it reads all mass/intensity data of a spectrum into memory. That's fine for most methods because they work on a single spectrum (and a single spectrum should fit in memory). matter
offers the possibility to work on chunks (if a spectrum doesn't fit into memory) or you want to run methods across spectra.
While I really appreciate your help and effort and I am very interested in improving MALDIquant
but I am curious why you want to use MALDIquant
instead of Cardinal
for IMS. Looking over matter
and Cardinal
it seems to be more sophisticated and better suited for IMS-analysis.
It's a valid point, but honestly I find the processing methods implemented in MALDIquant
superior and clearer to the ones in Cardinal
, and being stored in lists, it is easier (and somewhat faster) to parallelize. Using lists to store peak information is also more convenient as opposed to sparse matrices in Cardinal
. In general, I prefer to use MALDIquant
for preprocessing and Cardinal
for visualization and manipulation.
Nice, to hear! Thanks.
Hi Sebastian,
I implemented what we were discussing above! I created a new class MassSpectrumOnDisk
whose mass
and intensity
slots are of class matter_vec
. I used a class union of numeric
and matter_vec
for the AbstractMassObject
i.e. setClassUnion("NumericOrOnDisk", c("numeric", "matter_vec"))
. For the most part, I avoided editing MassSpectrum
and MassPeaks
methods editing only some aspects of AbstractMassObject
methods and adding methods for MassSpectrumOnDisk
type. I tested most of the functions routinely used for preprocessing and everything seems to work flowlessly. I had some trouble with the trimming and subsetting methods since it does not make sense to trim on-disk objects, which I solved by creating new MassSpectrum
objects whenever trimming is needed (for example for the PlotMsiSlice
). calling detectPeaks
on MassSpectrumOnDisk
objects will create in-memory MassPeaks
objects. I also updated importImzML
in MALDIquantForeign
to be able to create instances of MassSpectrumOnDisk
objects successfully attaching the imzML dataset without loading it. Once can also use createMassSpectrumOnDisk
:
> createMassSpectrumOnDisk(seq(1,100), seq(1,100))
S4 class type : MassSpectrumOnDisk
Number of m/z values : 100
Range of m/z values : 1 - 100
Range of intensity values: 1e+00 - 1e+02
Memory usage : 13.516 KiB
Path to on-disk spectra : C:\Users\DENIS_~1\AppData\Local\Temp\RtmpUHGrxv\spectrum3b430a83046.mass
: C:\Users\DENIS_~1\AppData\Local\Temp\RtmpUHGrxv\spectrum3b41621428f.intensity
As an example, I have an example imaging dataset which consists of 10924 spectra with 10000 bins each:
disk = importImzMl(path = "F:\\R projects\\gist_big_experiment\\sample_test\\patAB.imzML", verbose = F, attachOnly = T, duplicateFile = T)
disk[[1]]
S4 class type : MassSpectrumOnDisk
Number of m/z values : 10000
Range of m/z values : 299.95 - 1998.982
Range of intensity values: 5.463e-03 - 7.607e-01
Memory usage : 15.625 KiB
File : F:\R projects\gist_big_experiment\sample_test\patAB.imzML
Path to on-disk spectra : C:\Users\DENIS_~1\AppData\Local\Temp\RtmpUHGrxv\file3b43fafa7_patAB.ibd
MALDIquant:::.memoryUsageStr(disk)
[1] "166.77 MiB"
diskLoadTime ## time it took to load the spectra
Time difference of 39.32421 secs
> a = Sys.time();disk = removeBaseline(disk);processingTime = Sys.time() - a
> processingTime
Time difference of 27.21858 secs
In comparison, loading the spectra into memory:
mem = importImzMl(path = "F:\\R projects\\gist_big_experiment\\sample_test\\patAB.imzML", verbose = F, attachOnly = F)
mem[[1]]
S4 class type : MassSpectrum
Number of m/z values : 10000
Range of m/z values : 299.95 - 1998.982
Range of intensity values: 5.463e-03 - 7.607e-01
Memory usage : 159.805 KiB
File : F:\R projects\gist_big_experiment\sample_test\patAB.imzML
> MALDIquant:::.memoryUsageStr(mem)
[1] "1.665 GiB"
> diskLoadTime
Time difference of 24.93157 secs
> a = Sys.time();mem = removeBaseline(mem);processingTime = Sys.time() - a
> processingTime
Time difference of 11.86915 secs
I made a lot of changes and updated the documentation files for both MALDIquant
and MALDIquantForeign
, it would be very nice if you can review my changes and perhaps test them. How should we manage that? should I push
the project directly from rstudio to your github repository?
Awesome! Looks promising!
Could you please fork the MALDIquant
(and the MALDIquantForeign
) repository and create a pull request? Then we could discuss and latter merge the code easily.
Done.
For MassSpectrumOnDisk
I have noted that some functions ex. calibrateIntensity
and averageMassSpectra
are still evaluated sequentially even with mc.cores
is set to >1
. When packing these two inside an mclapply
function, parallel evaluation proceeds without problems. Is it possible that they have a simple lapply
instead of the .lapply
function inside? cannot get my head around the code.
For matter
package, I specified a version constraint (>= 1.8.0)
. This is the one I am using on my system but the classes used where already defined available in (>= 1.4.1)
. Maybe it is more convenient to use the latter.
I fixed calibrateIntensity
, thanks for finding this! I have to look into averageMassSpectra
a little bit deeper.
@dsammour sorry for the delay. I partly incorporated your PR and pushed a working (at least for me) version of the MassSpectrumOnDisk
support to github (documentation and unit tests are still missing). Could you please try if this satisfy your needs?:
devtools::install("sgibb/MALDIquant@OnDiskVector")
devtools::install("sgibb/MALDIquantForeign@OnDiskVector")
library("MALDIquant")
library("MALDIquantForeign")
# example file from
# https://ms-imaging.org/wp/wp-content/uploads/2019/03/S042_Continuous_imzML1.1.1.zip
f <- "S042_Continuous.imzML"
mem <- import(f, verbose=FALSE)
ods <- import(f, attachOnly=TRUE, duplicateFiles=TRUE, verbose=FALSE)
odp <- import(f, attachOnly=TRUE, duplicateFiles=FALSE, verbose=FALSE)
mem[1:2]
# [[1]]
# S4 class type : MassSpectrum
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 8.61e+00
# Memory usage : 8.625 KiB
# File : /home/sebastian/tmp/downloads/S042_Continuous.ibd
#
# [[2]]
# S4 class type : MassSpectrum
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 1.827e+01
# Memory usage : 8.625 KiB
# File : /home/sebastian/tmp/downloads/S042_Continuous.ibd
#
ods[1:2]
# [[1]]
# S4 class type : MassSpectrumOnDisk
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 8.61e+00
# Memory usage : 7.234 KiB
# File : /tmp/Rtmpzl41n8/S042_Continuous6fb97b37cbbb.ibd
#
# [[2]]
# S4 class type : MassSpectrumOnDisk
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 1.827e+01
# Memory usage : 7.234 KiB
# File : /tmp/Rtmpzl41n8/S042_Continuous6fb97b37cbbb.ibd
#
odp[1:2]
# [[1]]
# S4 class type : MassSpectrumOnDisk
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 8.61e+00
# Memory usage : 7.281 KiB
# File : /home/sebastian/tmp/downloads/S042_Continuous.ibd
#
# [[2]]
# S4 class type : MassSpectrumOnDisk
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 1.827e+01
# Memory usage : 7.281 KiB
# File : /home/sebastian/tmp/downloads/S042_Continuous.ibd
#
@sgibb sorry for the huge delay as well. Just wanted to let you know that I have been working with the version I pushed for some time now ("onDiskVector" and "onDisk_support_for_imzML" branches of MALDIquant and MALDIquantForeign, respectively), everything seems to work smoothly. Note that I had to change the functionality of some methods to accommodate the MassSpectrumOnDisk objects, for example the "trim" method, does the trimming but returns a MassSpectrum object afterwards (thought that would be the easiest solution - trimming the file on disk would not make much sense!). However, under the "OnDiskVector" branch under your MALDIquant repository, I could not see these changes, did you have a different solution for the problem?
Hi @sgibb and @dsammour ,
thank you very much for working on this. I hope you can continue this good work and make the changes available through CRAN soon.
I am looking forward to the support for on disk imzML :)
Thank you, Melanie
Sorry for this huge delay. Unfortunately the write support in the OnDiskVector
class is not working for binary files with an offset (as nearly every .ibd
file use it). If we disable write support these doesn't really make sense at all because you have to import all spectra at once into memory (instead one at the time its used) as soon as you call the first function.
Please see this stackoverflow question that describes the problem: https://stackoverflow.com/questions/60158312/how-to-replace-a-value-in-a-binary-file-in-r
@dsammour did it really work for you?
Ok, I found a solution for the above problem. IMHO it is working now as expected. Could you please try the latest versions of my branches?:
devtools::install_github("sgibb/MALDIquant@OnDiskVector")
devtools::install_github("sgibb/MALDIquantForeign@OnDiskVector")
library("MALDIquant")
library("MALDIquantForeign")
# example file from
# https://ms-imaging.org/wp/wp-content/uploads/2019/03/S042_Continuous_imzML1.1.1.zip
f <- "S042_Continuous.imzML"
mem <- import(f)
ods <- import(f, attachOnly=TRUE)
Thank you very much for following up on the on disk support.
I have tried your code in Rstudio cloud with R3.6.0.
Simple commands such as length(ods) and summary(ods) work.
But I get the same error for other commands such as:
avgSpectra = averageMassSpectra(ods,method="mean")
plot(ods[[1]])
any(sapply(ods, isEmpty))
transformIntensity(ods, method="sqrt")
calibrateIntensity(ods, method="TIC")
@foellmelanie thanks for testing. Which kind of error do you get?
For me it works like expected:
#devtools::install_github("sgibb/MALDIquant@OnDiskVector")
#devtools::install_github("sgibb/MALDIquantForeign@OnDiskVector")
library("MALDIquant")
#>
#> This is MALDIquant version 1.99.1
#> Quantitative Analysis of Mass Spectrometry Data
#> See '?MALDIquant' for more information about this package.
library("MALDIquantForeign")
# example file from
# https://ms-imaging.org/wp/wp-content/uploads/2019/03/S042_Continuous_imzML1.1.1.zip
download.file(
"https://ms-imaging.org/wp/wp-content/uploads/2019/03/S042_Continuous_imzML1.1.1.zip",
"S042_Continuous_imzML1.1.1.zip"
)
unzip("S042_Continuous_imzML1.1.1.zip")
f <- "S042_Continuous.imzML"
mem <- import(f)
ods <- import(f, attachOnly=TRUE)
avgSpectra <- averageMassSpectra(ods, method="mean")
plot(ods[[1]])
any(sapply(ods, isEmpty))
#> [1] FALSE
ods <- transformIntensity(ods, method="sqrt")
ods <- calibrateIntensity(ods, method="TIC")
plot(ods[[1]])
@sgibb sorry forgot to paste the error. Then its probably just a problem of the path because I used Rstudio cloud:
Error in file(con, "rb") : cannot open the connection In addition: Warning message: In file(con, "rb") : cannot open file '/tmp/RtmpfjI3Vf/S042_Continuous1482df6d1d5.ibd.intensity.1216.mod': No such file or directory
Mh, I can't reproduce this error in https://rstudio.cloud/project/950000
Thanks @sgibb, all good now, I think something was wrong with my file. I downloaded it again and now it works. Works also for other files.
Sorry for the late reply, I had a fare share of deadlines to meet.
I did two tests; one for the branch I suggested and one for the 'onDiskVector' branch. Both are working properly.
Note: on a linux-based system.
> d1 = MALDIquantForeign::importImzMl("/mass-store/export/process/imzML data clinical samples/raw imzml/[MODIFIED] 20160526_SET4_SL2_patAb.imzML", attachOnly = TRUE, verbose = FALSE)
NOTE: imzML dataset was loaded via attacheOnly option and a duplicate file was generate. Any changes made to the spectra are directly written to the duplicate file.
> d1[[1]]
S4 class type : MassSpectrumOnDisk
Number of m/z values : 120000
Range of m/z values : 300 - 2000
Range of intensity values : 0e+00 - 1.799e+00
Memory usage : 6.742 KiB
File : /mass-store/disk3/process/imzML data clinical samples/raw imzml/[MODIFIED] 20160526_SET4_SL2_patAb.imzML
Path to on-disk mass values: /mass-store/ceph/tmp/RtmpfeOswi/file1d96ca2006d9_[MODIFIED] 20160526_SET4_SL2_patAb.ibd
Path to on-disk intensities: /mass-store/ceph/tmp/RtmpfeOswi/file1d96ca2006d9_[MODIFIED] 20160526_SET4_SL2_patAb.ibd
> plot(d1[[1]])
> d1 = calibrateIntensity(d1, "TIC")
Note: On Windows.
> d1 = MALDIquantForeign::importImzMl("F:\\R projects\\GIST\\raw_data\\[MODIFIED] 20160526_SET4_SL2_patAb.imzML", attachOnly = TRUE, verbose = FALSE)
> d1[[1]]
S4 class type : MassSpectrumOnDisk
Number of m/z values : 120000
Range of m/z values : 300 - 2000
Range of intensity values: 0e+00 - 1.594e+00
Memory usage : 7.211 KiB
File : C:\Users\DENIS_~1\AppData\Local\Temp\RtmpigKF7b\[MODIFIED] 20160526_SET4_SL2_patAb142418bd719f.ibd
> plot(d1[[1]])
> d1 = calibrateIntensity(d1, "TIC")
> plot(d1[[1]])
The same plots as above were generated.
I think it would really make sense to generate a note when loading attached files as the one I added within ìmportImzML` for example:
if(attachOnly)
{
if(duplicateFile)
message("\nNOTE: imzML dataset was loaded via attacheOnly option and a duplicate file was generate. ",
"Any changes made to the spectra are directly written to the duplicate file.\n ")
else
message("\nNOTE: imzML dataset was loaded via attacheOnly option to the ORIGINAL FILE. ",
"Any changes made to the spectra are directly written to the imzML file.\n ")
}
I would also suggest to keep the following part which guards against the possibility that a processed
imzML file is imported:
if(isProcessed && attachOnly){
message("The imzML file is of type 'processed'. The 'attachOnly' option is only available ",
"for 'continuous' type and therefore will be overridden. In-memory MassPeaks objects will be created.")
attachOnly <- FALSE
}
So it will override the àttachOnlyvariable and will create
MassPeaks` objects in memory in this situation.
I also didn't understand how the .isModified.OnDiskVector(x)
works, could you @sgibb please clarify that.
On another note, the above code seems not to work with imzML files generated by MALDIquantForeign::exportImzML
:
> d = MALDIquantForeign::importImzMl("F:\\R projects\\GIST\\sample_test\\patAB.imzML", attachOnly = TRUE, verbose = FALSE)
Invalid argumentCouldn't find end of Start Tag referenceableParamGr line 21338
Premature end of data in tag spectrum line 21337
Premature end of data in tag spectrumList line 94
Premature end of data in tag run line 93
Premature end of data in tag mzML line 2
Error: 1: Invalid argument2: Couldn't find end of Start Tag referenceableParamGr line 21338
3: Premature end of data in tag spectrum line 21337
4: Premature end of data in tag spectrumList line 94
5: Premature end of data in tag run line 93
6: Premature end of data in tag mzML line 2
where patAB.imzML
was generated by èxportImzML` method. But I guess this is a secondary issue.
Hi,
I am using
MALDIquant
extensively to process MSI datasets. These datasets tend to be extremely large (<50GB when loaded into memory into lists ofMassSpectrum
objects). The issue here is that eachMassSpectrum
object stores intensities and corresponding m/z values, which means huge redundancy and unnecessary memory usage since the mass axis is exactly identical for every spectrum. I was wondering if it would be possible to exclude the mass axis fromMassSpectrum
objects and possibly create a new type of object, sayMassAxis
, which can be passed as an argument to the processing functions. Obviously, this cannot be applied toMassPeak
objects.Thanks, Denis