sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
180 stars 80 forks source link

XCMS resets retention time correction when merging #548

Open JohanLassen opened 3 years ago

JohanLassen commented 3 years ago

Hi,

I use a cluster for XCMS processing and often split my workflow into smaller steps that are run whenever resources are available. Currently, I load and identify peaks on individual files in one R script and merge them in another. This saves a lot of time, and if some jobs crash the remaining need no recalculation.

I'm wondering if Obiwarp can be parallelized on different jobs as well. As far as I know, Obiwarp only makes pairwise alignments, where it corrects the rt of the non-center sample. My thought is to align one sample at a time against the center sample, save the aligned sample, and merge all samples once they are computed.

So far I've found that XCMS resets the retention time corrections when using c() to merge files with aligned retention times:

Reverting retention times of identified peaks to original values ... OK
Warnings:
1: I FUN(X[[i]], ...) : Adjusted retention times found, had to drop them.

I hope there might be a solution to this and that my idea is reasonable? The fillpeaks step could maybe also benefit from this.

Best, Johan

jorainer commented 3 years ago

Yes, Obiwarp uses one sample as reference so in theory this should be possible. I am however a little reluctant to implement here something as this sounds simple at first, but can easily become tricky and complicated. In general, the alignment and correspondence steps should not be that computational intense to really require parallel processing.

I am very well aware of the issue with parallel processing on a cluster. I have to share our cluster with computational genomics people that run permanently GWAS analyses and other stuff on the cluster. We also did the peak detection first on a per file basis to be able to run it whenever a CPU is free. But what I learned from that is that combining the (in our case over 1000 result objects) into a single object for the data set is super-inefficient. This takes a lot of time and requires tons of memory. Thus I switched back to run the whole processing in one run, using 5-8 CPUs and a large amount of memory. In the end that was faster than doing the processing separately and then merging the results.

JohanLassen commented 3 years ago

Thanks for the reply - I have tried to run it as you recommend, and it works fine.

We solved the memory problem (of merging) by saving the XCMSnExp objects as .Rdata files, so we only parse each raw file once.

jorainer commented 3 years ago

I did the same - saving each XCMSnExp object (from a single mzML file) as a RData object and then, when everything was done, combining all of them into a single XCMSnExp with c. Problem is that this turned out to be inefficient, because each c call started the validateObject call that, for an XCMSnExp object also checks if the files are available and does some other checks. Also, the more objects were merged the longer it took to combine the chromPeaks matrices from the files. And that used in the end a lot of memory. Thus, if I did this on 5,000 files it took almost as long as running the whole peak detection on these files in one single call.

JohanLassen commented 3 years ago

I just did the peak detection file-wise for ~1400 samples. As you told the merging of the Rdata objects slowed down after a few hundred files. I did the merging one file at a time in a for-loop, meaning that the validateObject compared n^2 objects. To reduce the complexity, I loaded each object and assigned it to a distinct variable while building a string, "c(obj1, obj2...)", that was evaluated once all files were loaded. Only n validations are run:

files <- list.files(directory) %>% .[grepl(".Rdata",.)] # XCMSnExp objects saved as Rdata
concatenate_string <- "c("

# Iterate through files: Assign the XCMSnExp objects to variables and build the concatenate string
for (i in 1:length(files)){
    name <- paste0("obj", i)
    file <- files[i]
    load(file) # loads XCMSnExp object called peaks_file
    assign(name, peaks_file)
    if (i<length(files)){concatenate_string <- paste0(c(concatenate_string, name, ","), collapse = "")}
    if (i==length(files)){concatenate_string <- paste0(c(concatenate_string,  name, ")"), collapse = "")}
}

# Evaluate the string as code
xset_peaksIdentified <- eval(parse(text = concatenate_string))
save(xset_peaksIdentified, file = "./xset_peaksIdentified.Rdata")

This code merged 1400 files in 7 minutes, in contrast to the former approach that took 1 hour for 800 samples (I stopped it)

I still haven't given up on parallelizing some of the steps. We are currently building a database that is growing with a few hundred samples each month. Avoiding recalculation of every step for every file would be highly beneficial in the future.