sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
177 stars 81 forks source link

Why does fill peaks with XCMSnExp object result in considerably less features than fill peaks with XCMSset? #677

Closed Pembs closed 1 year ago

Pembs commented 1 year ago

I have been running the below workflow, where after retention time alignment , I either continue on with using the XCMSnExp objects for grouping and filling, or I convert to xcmsSet objects before grouping and filling.

I can see that the number of peaks for each sample remains the same after the grouping stage of each workflow.

image

But after filling, there are ~4000 more peaks in each sample from the XCMSset processing vs the XCMSnExp.

image

The total number of features obtained from the XCMSset is 5649 vs 876 from the XCMSnExp. Please could somebody tell me why filling with the older workflow is resulting in more features? I'm not sure and I'm uncertain which feature list to proceed with.

Workflow ==>

path <- dir(path="C:/Users/XXXXXX/pos_mzXML", pattern=".mzXML", all.files=T, full.names=T, recursive = F)

PD<- data.frame(sample_name=sub(basename(path, pattern=".mzXML", replacement = "", fixed=TRUE), sample_group=c(rep("sample", times=300), rep("QC", times=70)) stringsAsFactors = FALSE)

x <- SnowParam(workers=6, tasks=2, progressbar=T, manager.hostname = Sys.info()[["nodename"]])

raw_data <- readMSData(files = path, pdata = new("NAnnotatedDataFrame", PD), mode = "onDisk", msLevel.=1, centroided. = T)

rawdata <- filterEmptySpectra(raw_data)

param = CentWaveParam( ppm = 5, peakwidth = c(3, 50), snthresh = 10, noise = 15000, prefilter = c(3, 100), mzdiff = 0.0045, mzCenterFun = "wMean", integrate = 1, fitgauss = FALSE)

bpstart(x)

peak picking

xcms1 <- findChromPeaks(rawdata,param =param, BPPARAM=x)

cpc

cpc <- filter_xcms_peaklist(xd = xcms1, return_type = "cpc", param = cpc::cpcProcParam(min_sn = 10,min_pts = 10, min_intensity = 2000)) xcms_cpc <- getFilteredXCMS(cpc)

merging

mer <- MergeNeighboringPeaksParam()

register(SerialParam())

xdata_mer <- refineChromPeaks(xcms_cpc,mer)

align

obiwarpy = ObiwarpParam(binSize = 1, distFun = "cor_opt", response = 1, gapInit = numeric(), gapExtend = numeric(), factorDiag = 2, factorGap = 1, localAlignment = FALSE)

xdata2 <- adjustRtime(xdata_mer, param = obiwarpy)

group

xdata_group3 <- groupChromPeaks(xdata2, param=PeakDensityParam(sampleGroups = xdata2$sample_group, minFraction = 0.5, bw = 30, minSamples = 1, binSize = 0.05))

fill

xdata_fill4 <- fillChromPeaks(xdata_group3, param = ChromPeakAreaParam())

OR XCMSset

group and fill

xdata_old <- as(xdata2, "xcmsSet") xdata_oldgrp <- group(xdata_old, minfrac = 0.5, bw = 30, minsamp = 1, mzwid = 0.05) xdata_oldfill <- fillPeaks(xdata_oldgrp)

I'm using R version 4.2.0 and xcms version 3.2.0.

stanstrup commented 1 year ago

I don't have an example in front of me to check how things are exactly so I could be wrong but from memory:

Pembs commented 1 year ago

Thank you for your comment :

I realised that I hadn't specifed class in my phenodata dataframe. Whilst there are tutorials for the new XCMSnExp workflow, I'm yet to come across one for xcmsSet. I thought sample_group column would be enough for xcmsSet grouping, but perhaps not! After converting the xdata2 to an xcmSet object ( xdata_old), sampclass(xdata_old) also gets 370 levels, so one for each sample, when there should only be two ( one for sample and one for QC). I'll specify a class column in my PD/phenodata and will see how this affects the feature numbers through filling.

-thank you for the link , I followed the advice specified here and filtered out empty spectra.

Pembs commented 1 year ago

What I still remain confused by is why after grouping , where the xcmsSet had no sample group/class information whilst XCMSnExp did, how they still had the same number of peaks per sample after this? Whereas after filling, there are marked differences.

And if in a workflow, all samples are in the same group, do I need to have a class column?

stanstrup commented 1 year ago

How are you counting peaks? Aren't you counting the peaks found for each sample? Not the number of features after grouping.

The difference after filling seems large but you are using two different methods for gap-filling. The old version is equivalent to using FillChromPeaksParam instead of ChromPeakAreaParam. I don't know if that could explain the difference but it is a difference between the approaches at least.

Pembs commented 1 year ago

I have counted the number of peaks through either as.data.frame(table(peaks(xdata_oldgrp)[, "sample"])) for xcmsSet or through as.data.frame(table(chromPeaks(xdata_group3)[, "sample"])).

I've found two things (and will probably close this and re-open another issue): -if I specified my xdata_old$pheno$class as the original sample_group and then grouped and filled, I had similar numbers of peaks and the same number of features for each workflow. So I can close this issue because they are the same now if I am ensuring that the sample_groups ( for XCMSnExp) or class (xcmsSet) have been specified.

I suppose the previous confusion of the same number of peaks after grouping with and without class information no longer matters.

My issue now is why does grouping and subsequent filling when specifying sample group information (samples and QCs) result in considerably less peaks than without specifying group/class information. This must not just be due to the retention of features which are present in at least 50% of only one group (when minFraction is 0.5), because without class information, there are 5129 features present in at least 50% of any one group, but with class information there are 999.

Pembs commented 1 year ago

Other issue https://github.com/sneumann/xcms/issues/678