Update retention times to alkane retention time indices for GC-MS

CLUES-Emory commented 8 months ago

Hello! We are running large studies (>1000 samples) on GC-HRMS. Peak detection has been working beautifully. Over the course of the run, we have drifts in retention times due to column aging and periodic column changes. To help account for this effect, we run alkane mixes within each batch (96 samples). These are used to both monitor retention time changes, and calculate retention time indices for comparison to spectral databases. To improve alignment of our data, I would like to loop each batch and replace the raw retention times with retention time indices calculated from the alkane mix in that batch. Once updated, we will extract and align peaks using the traditional XCMS workflow.

My question is what is the most appropriate way to do this? I could think of a few ways, with varying efficiency.

Read in files for each batch using readMSData(). Replace raw retention time with rtime(raw_data)<- Calc_RI. Write to new mzML file using writeMSData(). Obviously this would be very slow and inefficient.
Read in files for all samples readMSData(). Replace raw retention time OR adjusted retention time with rtime(raw_data)<- Calc_RI seperately for each batch. Continue with data extraction. Adjusted retention time would be preferred so the original retention time could be retained
Read in files for all samples readMSData(). Detect peaks using findChromPeaks. Replace peak retention times with Calc_RI. Continue with data extraction. However, I could not get this to work, since it would not let me assign adjusted retention times manually.

Any ideas on how to best do this? One concern I had is how functions that call back to the mzML file (such as fillChromPeaks()) would handle the very different retention time indices. Another issue I'm running into is it won't let me subset XCMSnExp by samples using the [] command. Based on the documentation, I thought this was possible.

Thanks!

SessionInfo() is below.

R version 4.3.3 (2024-02-29) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Monterey 12.5

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York tzcode source: internal

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] Spectra_1.12.0 microbenchmark_1.4.10 doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2 xcms_4.0.2
[7] MSnbase_2.28.1 ProtGenerics_1.34.0 S4Vectors_0.40.2 mzR_2.36.0 Rcpp_1.0.12 Biobase_2.62.0
[13] BiocGenerics_0.48.1 BiocParallel_1.36.0

loaded via a namespace (and not attached): [1] bitops_1.0-7 rlang_1.1.3 magrittr_2.0.3 clue_0.3-65 MassSpecWavelet_1.68.0
[6] matrixStats_1.2.0 compiler_4.3.3 vctrs_0.6.5 pkgconfig_2.0.3 MetaboCoreUtils_1.10.0
[11] crayon_1.5.2 XVector_0.42.0 utf8_1.2.4 preprocessCore_1.64.0 MultiAssayExperiment_1.28.0 [16] zlibbioc_1.48.0 GenomeInfoDb_1.38.7 progress_1.2.3 DelayedArray_0.28.0 prettyunits_1.2.0
[21] cluster_2.1.6 R6_2.5.1 RColorBrewer_1.1-3 limma_3.58.1 GenomicRanges_1.54.1
[26] SummarizedExperiment_1.32.0 snow_0.4-4 IRanges_2.36.0 Matrix_1.6-5 splines_4.3.3
[31] igraph_2.0.2 tidyselect_1.2.0 abind_1.4-5 codetools_0.2-19 affy_1.80.0
[36] lattice_0.22-5 tibble_3.2.1 plyr_1.8.9 survival_3.5-8 pillar_1.9.0
[41] affyio_1.72.0 BiocManager_1.30.22 MatrixGenerics_1.14.0 MALDIquant_1.22.2 ncdf4_1.22
[46] generics_0.1.3 RCurl_1.98-1.14 hms_1.1.3 ggplot2_3.5.0 munsell_0.5.0
[51] scales_1.3.0 MsExperiment_1.4.0 glue_1.7.0 MsFeatures_1.10.0 lazyeval_0.2.2
[56] tools_4.3.3 mzID_1.40.0 robustbase_0.99-2 QFeatures_1.12.0 vsn_3.70.0
[61] RANN_2.6.1 fs_1.6.3 XML_3.99-0.16.1 grid_4.3.3 impute_1.76.0
[66] MsCoreUtils_1.14.1 colorspace_2.1-0 GenomeInfoDbData_1.2.11 cli_3.6.2 fansi_1.0.6
[71] S4Arrays_1.2.1 dplyr_1.1.4 AnnotationFilter_1.26.0 pcaMethods_1.94.0 gtable_0.3.4
[76] DEoptimR_1.1-3 digest_0.6.34 SparseArray_1.2.4 multtest_2.58.0 lifecycle_1.0.4
[81] statmod_1.5.0 MASS_7.3-60.0.1

CLUES-Emory commented 8 months ago

I was able to implement the second option; however, am now receiving the following error when I try to do my peak picking:

Error: BiocParallel errors 4 remote errors, element index: 1, 6, 11, 16 16 unevaluated and other errors first remote error: Error in findChromPeaks_Spectrum_list(x = spectra(object, BPPARAM = SerialParam()), : Spectra are not ordered by retention time!

sneumann commented 8 months ago

Hi, the function expects strictly ordered RT for spectra. Could it be that you output integers as RT, and that two consecutive spectra have the same value ? Can you crank up the decimals, even if they were useless for a human observer ? Yours, Steffen

sneumann commented 8 months ago

Hi, I just realised that possibly the PR #732 by @philouail might solve a similar use case to yours ? Yours, Steffen

CLUES-Emory commented 8 months ago

Thanks Steffen! PR #732 is a potential approach that could work, although I'm not sure if our alkane mixes were run frequently enough to be used for alignment. I've actually gotten my code to work after increasing the number of decimals, and it seems to be greatly improving our alignment (less missing values) so far. Thanks for all your help.

jorainer commented 8 months ago

Note: we are currently adding the function to force retention times to be sorted to MsCoreUtils: https://github.com/rformassspectrometry/MsCoreUtils/pull/123 - so, in future you could use that official force_sorted() function directly.

sneumann / xcms

Update retention times to alkane retention time indices for GC-MS #731