mzdb / rmzdb

R bindings for the libmzdb C library
Other
0 stars 1 forks source link

Possible to extract a m/z - rt slice of data? #3

Open jorainer opened 6 years ago

jorainer commented 6 years ago

Assuming I have an retention time range and an m/z range, what would be the fastest/easiest way to extract the intensity values within these ranges from an mzDB file?

Looking through the exported methods I didn't spot a function that would enable such a call.

Maybe a stupid question, but I assume mzDB files index the spectrum data by retention time/scan index (same as in mzML). Are intensity values also indexed by m/z (to support faster extraction in the above case)?

david-bouyssie commented 6 years ago

Assuming I have an retention time range and an m/z range, what would be the fastest/easiest way to >extract the intensity values within these ranges from an mzDB file? Looking through the exported methods I didn't spot a function that would enable such a call.

This is still in the list of TODOs. For now we only expose single spectrum or spectra iterator interfaces. The XICs functions need to be ported from the Java implementation.

Maybe a stupid question, but I assume mzDB files index the spectrum data by retention time/scan index (same as in mzML). Are intensity values also indexed by m/z (to support faster extraction in the above case)?

Yes spectra are indexed by their primary key (database ID) and two other indexes (https://github.com/mzdb/mzdb-specs/blob/master/version_0.7.0/mzDB_0.7.0_schema_script.sql):

CREATE UNIQUE INDEX spectrum_initial_id_idx ON spectrum (initial_id ASC,run_id ASC);
CREATE INDEX spectrum_ms_level_idx ON spectrum (ms_level ASC,run_id ASC);

Peaks ([m/z - intensity] pairs) are also indexed by m/z thanks to the Bounding Box representation. The default size of each MS1 level BB is [5 Da ; 15s].

jorainer commented 6 years ago

thanks for the clarification

david-bouyssie commented 6 years ago

You are welcome.

I reopen to have it on the list of TODOs.

heejongkim commented 5 years ago

Any progress on this issue? It would be awesome to have XIC across RT function for DIA (given precursor m/z bin).

Thanks!

david-bouyssie commented 5 years ago

No progress for now. The student who was working on this has finished his internship. I have to update the current code base to enable this functionality. The easiest possibility for me would be to create a RCpp C++ Class using the "BBOX ITERATOR" functions defined in libmzdb. It should not be very complicated. Let see if I can work on this next week.

heejongkim commented 5 years ago

That would be tremendously helpful for my little script. I used MSnbase to extract several thousands of XIC across entire RT (both ms1 and ms2). Even though I did nested parallelization, I can't go below 30min processing time per file. Directly utilizing the vendor's binary access is sweet but I'd like to stick to the standard format to support cross-platform. I can't wait your next update. I'm so excited!!

Thanks!

jorainer commented 5 years ago

@heejongkim: to extract multiple XIC from the same file I suggest that you pass the definition for the mz and rt range as matrix to the chromatogram function (one row for e.g. lower and upper rt, or mz). The chromatogram for OnDiskMSnExp has been optimized to load the data required for each XIC only once. Alternatively, you could try to load the data into memory (i.e. as an MSnExp) - could also be a little faster.

david-bouyssie commented 5 years ago

Here are some good news. I have an experimental version which works on my laptop (SSD drive).

Here are some information regarding the dataset:

Total XICs duration: 10 seconds

I need to perform other benchmarks and to test on DIA files, but these are interesting preliminary results I wanted to share.

heejongkim commented 5 years ago

@jotsetung Thanks for the suggestion. By doing a minor change to accommodate the matrix-based extraction, I was able to bring it down to 17 min.

@david-bouyssie Awesome, I can't wait to give it a shot on my DIA datasets. Thanks for your great work.

david-bouyssie commented 5 years ago

@heejongkim great. how many XICs do you perform?

heejongkim commented 5 years ago

@david-bouyssie Roughly 20,000 ~ 30,000 for now

heejongkim commented 5 years ago

@david-bouyssie That number range is the current rough estimation from my peptide table without counting M+1, M+2 or SILAC pairs. Also, this needs to be done on multiple files (from 10 to 100 raw files). So, I wonder if mzDB interface will behave well under foreach or mclapply parallelization. Many Thanks!!

david-bouyssie commented 5 years ago

@heejongkim: ok, thank you for the update. I made some other tests on the same data file. Here are the results:

Thus, in this context it appears that R is 2 to 4 times slower than bare metal C++.

If I goes down to 27451 XICs the total duration is about 8.11 seconds when calling from R.

Note that I have not checked to quality of the retrieved data. I hope everything went fine and that I made no mistake that could explain the so fast observed results.

The next step is to test reading DIA files with rmzdb. Stay tuned ;-)

david-bouyssie commented 5 years ago

About the parallelization, I would suggest to not perform the XICs in parallel inside a given mzDB for multiple reasons that I could discuss here if you want.

You could effectively open multiple mzDB files in parallel and you might observe some slight improvements but I guess only for a limited number of concurrent file operations.

However, my advice would be to avoid parallelization for IO operations and to apply it to your post-processing pipeline where CPU usage might be important.

heejongkim commented 5 years ago

@david-bouyssie So, it's still awesomely fast so I don't think I really need to parallelize the XIC. I can't wait to try it out on my DIA datasets. Thank you so much for the update.

Out of curiosity, "XICs duration when performed from C++ and returning R data structure" <- this will be some sort of compiled Rcpp interface? I'm very much interested in performance improvement when appropriate and applicable.

Thanks!

david-bouyssie commented 5 years ago

@heejongkim: effectively when I don't send the results back to R then the observed timings are better. It happens when everything is done in C++ only, without constructing the R data matrix and without returning it to R via Rcpp. However, IMO the observed will probably be negligible compared to the time required to post-process the XICs.

And if you really want high performance data processing, it might be worth to implement some parts of your post-processing pipeline in C++ using Rcpp.

heejongkim commented 5 years ago

@david-bouyssie I totally agree. Eventually, I need to learn Rcpp for the post-processing for the alignment, co-elution evaluation, and scoring functions. Whenever the test version is available, let me know. Again, thank you so much!

david-bouyssie commented 5 years ago

@heejongkim: good news. I just released a preview version of rmzdb including support for DIA XICs. The code is not committed yet because I need to clean it a little bit and this will take time, and as you know this the most wanted resource... The main issue I had is relative the XML parsing library I was using. I had some weird bugs when using it from RCpp. I spent some time trying to fix it, but in the end I decided to switch to a different library which faster and more stable. I thus need to update the libmzdb project to finish this dependency switch.

In the meantime you can try this pre-release: https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip

I hadn't the time to make QC against the retrieved XICs. So please double check that you obtain reliable results compared to what you observe with other file formats. If you want more details we can plan a call.

Side note: I saw that you work in a core facility. You might be interested to check the tools we recently developed for DDA datasets: http://www.profiproteomics.fr/proline/ It is of course based on the mzDB file format and thus the quantification phase is very fast. We will soon submit a manuscript to present some benchmarks regarding the quality of the obtained results.

heejongkim commented 5 years ago

@david-bouyssie Thank you so much for the alpha version. I'm so excited but embarrassingly having a little trouble to install the alpha version here. Here's what I used to try it:

install.packages("https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip", repos = NULL, type="source", destdir = 'C:/temp/source', lib = 'C:/temp/libs')

trying URL 'https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip' Content type 'application/octet-stream' length 1893409 bytes (1.8 MB) downloaded 1.8 MB

Warning in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) : cannot open compressed file 'rmzdb_0.1_dia_preview/DESCRIPTION', probable reason 'No such file or directory' Error in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) : cannot open the connection In R CMD INSTALL Warning in install.packages : installation of package ‘C:/temp/source/rmzdb_0.1_dia_preview.zip’ had non-zero exit status

I thought that it's some sort of a permission problem so I started the Rstudio with admin permission and add destdir and lib locations but none of them was helpful.

Have you ever encountered something like this before?

Thanks!

david-bouyssie commented 5 years ago

It might be my mistake because I created the zip file manually. The zip file contains an rmzdb folder. Maybe the R installation system assumes that the package fIles are located at the root level of the archive. What I could suggest is to unzip the archive and then call "R CMD build rmzdb" from the unzipped folder (the one containing the rmzdb folder). Then from Rstudio change the workspace location to the same directory anf execute: install.packages("rmzdb", repos = NULL, type="source", INSTALL_opts=c("--no-multiarch"))

david-bouyssie commented 5 years ago

@heejongkim: I found a problem in the zip file. The DESCRIPTION file contained a blank line that was not respecting the R DESCRIPTION file writing convention. After removing this blank line, install.packages is effectively working. Sorry for this issue, I edited the file just before uploading...

I just updated the released zip file, so if you download the fixed archive it should also work for you. Note that to have a proper auto-completion of available methods (on each specific class/object) you need a recent version of R Studio. I discovered that old versions were not supporting this feature quite well. See: https://github.com/rstudio/rstudio/issues/1654

heejongkim commented 5 years ago

@david-bouyssie Unfortunately, I still get the same error message.

install.packages("https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip", type="source", repo=NULL) trying URL 'https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.zip' Content type 'application/octet-stream' length 1898519 bytes (1.8 MB) downloaded 1.8 MB

Warning in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) : cannot open compressed file 'rmzdb_0.1_dia_preview/DESCRIPTION', probable reason 'No such file or directory' Error in read.dcf(file.path(pkgname, "DESCRIPTION"), c("Package", "Type")) : cannot open the connection In R CMD INSTALL Warning in install.packages : installation of package ‘C:/Users/ADMINI~1/AppData/Local/Temp/2/Rtmpa6Fv08/downloaded_packages/rmzdb_0.1_dia_preview.zip’ had non-zero exit status

Also, I updated the RStudio as of now. I believe the issue is still coming from DESCRIPTION.

Will try R CMD INSTALL ...

david-bouyssie commented 5 years ago

@heejongkim: as I said previously, the provided zip file need to be first unzipped because it's a manually created archive. However I see that it's simpler to perform the install.packages directly on the archive so I replaced the manually created zip file by a tar.gz one created using R CMD BUILD.

I tested the following command on a clean computer and it works: install.packages("https://github.com/mzdb/rmzdb/releases/download/v0.1-alpha/rmzdb_0.1_dia_preview.tar.gz", type="source", repo=NULL, INSTALL_opts=c("--no-multiarch"))

I hope it will also work for you!

heejongkim commented 5 years ago

@heejongkim Ooops. It got installed beautifully. Let me give a shot.

david-bouyssie commented 5 years ago

@heejongkim: great :D If you want some code examples, have look at the script named "test_dia.R" (should be inside the tar.gz archive). If something is not clear just let me know. I hope it will perform well with your datasets. Side note: if you want to increase performance, it might be possible to change the settings you use to create the mzDB file.s If you define smaller m/z windows (let say 1 Da instead of 5 Da) you may observe a performance increase. However the file size should be a little bit higher. I never made this comparison on DIA data so I may be wrong...

david-bouyssie commented 5 years ago

@heejongkim So, are you satisfied with the current release?

heejongkim commented 5 years ago

@david-bouyssie Sorry. I had a grant due. I'm back and will check it out tonight (hopefully). Thanks!

heejongkim commented 5 years ago

@david-bouyssie

It seems it doesn't work on my hands and probably it's due to my dumb setting or conversion. If you can enlighten me, it would be great.

Here's what I did. First, I made the mzDB by following command: raw2mzDB_0.9.10_build20170802\raw2mzDB.exe -c 1-2 -a dia -s -i 1.raw -o 1.mzDB ---- last few lines of outputs ---- W1027 13:15:45.087618 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode W1027 13:15:45.090617 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode W1027 13:15:45.093627 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode W1027 13:15:45.097627 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode W1027 13:15:45.100641 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode W1027 13:15:45.103631 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode W1027 13:15:45.106627 47052 swath_producer.hpp:178] Low resolution spectrum is not normal in DIA mode [======> ] 12% I1027 13:15:47.431748 42164 mzdb_writer.hpp:536] Creating permanent spectrum table I1027 13:15:47.519750 42164 mzdb_writer.cpp:368] Creates indexes... I1027 13:15:47.892771 42164 spectrum_inserter.h:599]

**** Summary of the conversion ****

Created file: C:\Users\Administrator\Desktop\jcl\HPV_BPV_runs\20181013_jcl_McBride_293TT_HPVctrls_histones_1.mzDB MS Input Output Nb spectra MS1 PROFILE CENTROID 11541

I1027 13:15:47.896770 42164 raw2mzDB.cpp:635] Checking run slices numbers I1027 13:15:48.995817 42164 raw2mzDB.cpp:652] Elapsed Time: 292.547 sec ---------------- End of the output from raw2mzDB

after that, I followed your test script and, from dia window extraction, I get nothing:

mzdb_obj <- new(MzDb, "1.mzDB") Opening mzDB file... bpc_mat <- mzdb_obj$get_bpc_matrix(1) print(nrow(bpc_mat)) [1] 11541 dia_windows <- mzdb_obj$get_dia_isolation_windows() print(nrow(dia_windows)) [1] 0 head(dia_windows) <0 x 0 matrix>

Thank you so much!

best, hee jong

p.s. I opened the mzDB directly from SQLite Browser but I realized that I need to study your paper more. I'm heading back to mzDB paper now.

david-bouyssie commented 5 years ago

It seems that your DIA windows are not well defined. It may be an issue relative to the raw file conversion. Could you try to convert with the option -f 1-2 instead of -c 1-2 ? It should not have a direct link with this issue bit I would to check if it changes something.