sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
177 stars 81 forks source link

Retention time alignment using PeakGroupsParam's peakGroupsMatrix parameter renders error #715

Closed pablovgd closed 5 months ago

pablovgd commented 5 months ago

Hi all,

I've been writing a function which makes it easy to extract targeted m/z & RT regions from Spectra data and integrates peaks in that area. To make things a bit easier for myself, I wanted to apply retention time alignment. I've tried it with obiwarp already (subset based and not-subset based), but wanted to give PeakGroups a shot as well. PeakGroups normally requires to perform correspondence first, but since performing untargeted peak picking first would defeat the whole point of my function for fast targeted data analysis, that isn't really an option.

In the documentation I found the following: Since the function is based on features (i.e. chromatographic peaks grouped across samples) a initial correspondence analysis has to be performed before using the groupChromPeaks() function. Alternatively, it is also possible to manually define a numeric matrix with retention times of markers in each samples that should be used for alignment. Such a matrix can be passed to the alignment function using the peakGroupsMatrix parameter of the PeakGroupsParam parameter object.

Hence I gave it a shot on a test set of 16 samples and 4 QC's:

data_batch <- readMsExperiment(spectraFiles = files_batch,
                                     backend = MsBackendMzR(),
                                     BPPARAM = SnowParam(workers = 1)
      )

      sampleData(data_batch)$sample_type <- "study"
      sampleData(data_batch)$sample_type[grep(pattern = QC_pattern, files_batch)] <- "QC"

      test <- as.matrix(cbind(rtmed,rtmed,rtmed,rtmed))
      param <- PeakGroupsParam(minFraction = 0.9, peakGroupsMatrix = test, subset = which(sampleData(data_batch)$sample_type == "QC"))

      data_batch <- adjustRtime(data_batch, param = param)

"rtmed" is a vector with the retention times of the "marker" compounds I want to use. So the matrix has x rows (I've already experimented with different amount of markers) for all the markers and 4 columns for each QC sample defined as the subset for alignment.

However, whatever I seem to try, the following error keeps popping up;

Performing retention time correction using x (amount of makers) peak groups.
Error in if (any(rtdevsmorange/rtdevrange > 2)) warn.overcorrect <- TRUE : 
  missing value where TRUE/FALSE needed

Any ideas?

Thanks a lot!

Kind regards, Pablo

jorainer commented 5 months ago

Hi Pablo! Which version of xcms are you using? We finally implemented/fixed that functionality with this PR, so you should definitely use the developmental version of xcms.

In the version before that it was (if I remember correctly) not properly using this matrix.

jorainer commented 5 months ago

and the rtmed that you use are the retention times of the marker peaks that you have in the actual data I assume, right?

jorainer commented 5 months ago

but basically yes, what you're doing is correct. You need to first define a numeric matrix, each column being a sample, each row a anchor peak and the values being the retention times of these anchor peaks in the respective samples. Then you assign that matrix to the peakGroupsMatrix of the PeakGroupsParam and run adjustRtime with that.

Maybe also get directly in touch with @philouail , she used that approach for one of our data sets where we had terrible retention time shifts between samples, so we wanted to run the alignment on the internal standards we included. After chrom peak detection she therefore first identified in each sample the chrom peak for the internal standards we had (using functionality from the MetaboAnnotation package) and then used that matrix of retention times to align the data set. was working great.

pablovgd commented 5 months ago

Hi Pablo! Which version of xcms are you using? We finally implemented/fixed that functionality with this PR, so you should definitely use the developmental version of xcms.

In the version before that it was (if I remember correctly) not properly using this matrix.

Hi Johannes! I think I'm using the devel version: xcms 4.1.5

and the rtmed that you use are the retention times of the marker peaks that you have in the actual data I assume, right?

Yup, that's right.

but basically yes, what you're doing is correct. You need to first define a numeric matrix, each column being a sample, each row a anchor peak and the values being the retention times of these anchor peaks in the respective samples. Then you assign that matrix to the peakGroupsMatrix of the PeakGroupsParam and run adjustRtime with that.

Maybe also get directly in touch with @philouail , she used that approach for one of our data sets where we had terrible retention time shifts between samples, so we wanted to run the alignment on the internal standards we included. After chrom peak detection she therefore first identified in each sample the chrom peak for the internal standards we had (using functionality from the MetaboAnnotation package) and then used that matrix of retention times to align the data set. was working great.

Ok great, that's indeed exactly what I'm trying to achieve. I also got a challenging analysis that ran over multiple days and had a column switch, so RT shifts are pretty bad. Goal was indeed to use the internal standards as the anchor peaks. I'll get in touch with Philly, wonder what's going on. Thanks!

philouail commented 5 months ago

Hi Pablo ! so to me it's seems you are doing everything right. This error confuses me a bit because I have not seen it before when ran with a PeakGroupsMatrix . I'm gonna look through the internal code to see what it could be. I'll come back to you. I can also show you the code that we used to generate the input matrix with the MetaboAnnotation functionalities if you are interested ? But to me it does not seem like the format of your input matrix is wrong, so I'm not sure that would help so much.

I am guessing you did but have you tried different RT values between the QC samples ?

I'll come back to you quickly :)

pablovgd commented 5 months ago

Hi Pablo ! so to me it's seems you are doing everything right. This error confuses me a bit because I have not seen it before when ran with a PeakGroupsMatrix . I'm gonna look through the internal code to see what it could be. I'll come back to you. I can also show you the code that we used to generate the input matrix with the MetaboAnnotation functionalities if you are interested ? But to me it does not seem like the format of your input matrix is wrong, so I'm not sure that would help so much.

I am guessing you did but have you tried different RT values between the QC samples ?

I'll come back to you quickly :)

Hi Philly! Thanks a lot for having a look at this. I indeed already tried different RT values as anchor peaks.

jorainer commented 5 months ago

coming back to Phili's observation (I completely overlooked that) - you define the matrix with

 test <- as.matrix(cbind(rtmed,rtmed,rtmed,rtmed))

so, you're using 4 times the exact same retention times - which I believe is actually the problem (I guess) because there might be then some division by 0 that results in NaN values that then ends up in the obscure error message above.

What you need to provide is the retention times of the actual chromatographic peaks for your features (on which you want to perform the alignment) in the specific samples.

What we did was the following: we had a data.frame with the mz and (approximate) retention times for internal standards (and some other compounds that we're pretty sure are always in our data. The task was then to find for each such standard in each file the chrom peak for that standard.

The code phili used (a little simplified and generalized).

#' Get the chrom peaks from our data set
cpks <- as.data.frame(chromPeaks(xdata))
cpks$peak_id <- rownames(cpks)

#' `standard` is a `data.frame` with the m/z (column "mz") and approximate 
#' retention time values (column "RT") for the compounds on which we 
#' want to align the data 

#' get ID for peaks matching with internal standards
for (i in seq_along(xdata)) {
    tmp <- cpks[cpks$sample == i, ]
    match_intern_standard <- matchValues(
        query = standard,
        target = tmp,
        mzColname = c("mz", "mz"),
        rtColname = c("RT", "rt" ),
        param = MzRtParam(ppm = 0, tolerance = 0.01, toleranceRt = 10))
    #' Select the chrom peak with the largest apex signal
    match_intern_standard <- filterMatches(
        match_intern_standard, SingleMatchParam(duplicates = "top_ranked",
                                                decreasing = TRUE,
                                                column = "target_maxo"))
    ID_table[, i] <- match_intern_standard$target_peak_id
}

This ID_table is now a character matrix with the chrom peak IDs, nrow is number of standards, ncol number of samples. For the matching we used functions from the MetaboAnnotation package (actually, the current development version because that one provides the SingleMatchParam option). Since it can happen that for one standard you have multiple potentially matching chrom peaks in the same sample - especially if you use loose m/z and rt tolerances. You need to somehow tackle that. In our case we select the chrom peak with the highest signal - we manually checked all of them to be sure, so for our data that was fine - alternatively you could also skip the assignment of a chrom peak in such cases.

The next thing we did is to get the retention times for the selected chrom peaks in the respective samples:

#' Function to create rt dataframe;
#' avoiding subset with NA turns out to be much more efficient
rtdf <- function(xdata, ID_table) {
    index <- as.vector(ID_table)
    nna <- !is.na(index)
    x <- rep(NA, length(index))
    x[nna] <- chromPeaks(xdata)[index[nna], "rt"]
    dim(x) <- dim(ID_table)
    rownames(x) <- rownames(ID_table)
    colnames(x) <- colnames(ID_table)
    x
}

#' run for nafld
RT_raw <- rtdf(xdata, ID_table)

And we can then use this matrix with retention times for the alignment.

param <- PeakGroupsParam(span = 0.5,
                         peakGroupsMatrix = RT_raw)
xdata <- adjustRtime(xdata, param = param, chunkSize = 10L)

So, it's key that the peakGroupsMatrix represents the actual retention times of anchor peaks in your data. And retention times of the same anchor peak (row) need to differ between samples - otherwise the data would already be super-aligned ;)

Hope this clarifies the approach a bit?

We're also working on an option to align a data set against an external reference (would allow to align data sets against each other) - and with that we will then also try to improve the documentation and add examples...

pablovgd commented 5 months ago

Oh ok! I misinterpreted Phili's remark concerning the "different RT's", I indeed just used the same expected RT for all internal standards in the QC's. Makes sense. I'll give the approach described above a shot tomorrow and keep you updated so that this issue can be closed if that solves things. Thanks a lot again to both of you!

pablovgd commented 5 months ago

Alright, so the target retention time of each internal being exactly the same for each sample in the input was indeed the problem. I rewrote my function so that it now first finds the exact RT of the peak apex of the internal standards in the QC samples using a targeted peak extraction function, and than applies the rt correction.

 #Define parameters for retention time adjustment, based on the QC's and the internal standards
      param <- PeakGroupsParam(minFraction = 0.9, peakGroupsMatrix = int_std, subset = which(sampleData(data_batch)$sample_type == "QC"))

      #Adjust the RT
      data_batch <- adjustRtime(data_batch, param = param)
      #Apply the adjusted RT 
      data_batch <- applyAdjustedRtime(data_batch)

Thanks again for helping out.

jorainer commented 4 months ago

thanks for reporting - that tells us we need to do a better documentation :)