Open Michaelncallahan opened 4 years ago
Hi, I have very little hands-on experience running Obiwarp in xcms myself. Any chance you try single-threaded first ? Does it work to align just two files ? Can you use msconvert or OpenMS to take just the first, say, 3600 spectra from each file ? Yours, Steffen
Thank you for the quick reply. I was certainly thinking about running them synchronously and I'll give that try. Is that a flag/param that I could pass to the ObiwarpParam
.
Something like:
data <- adjustRtime(raw_data, param = ObiwarpParam(binSize = .6, bpparam = ... )
I tried to subset
the files (to align two files) and still had the same issue.
I will filter the first 3600 spectra and see what happens. Then maybe try n-spectra where n is greater than the working number to find where it stops working - if it works that is.
Just to provide a quick update. I selected the first 3600 scans via msconvert, and encountered this error:
Sample number 2 used as center sample. Aligning LL2053_UPS2Price_Standard_rep3O1.mzML against LL2053_UPS2__Price_Standard_rep2O1.mzML ... Found gaps in scan times of the center sample: cut scantime-vector at NA seconds. Found gaps in scan time of file LL2053_UPS2Price_Standard_rep3O1.mzML: cut scantime-vector at NA seconds. Error: BiocParallel errors element index: 1, 2 first error: argument is of length zero
**Perusing the two previous issues with that problem.
ok, next wild guess. You've used the cool and current xcms3 interface. Let's try plain old xcms1. If that works, we'll have something to debug and fix.
Instead of readMSData()
you need an old xcmsSet
Object.
help("retcor.obiwarp")
xs <- xcmsSet(c(filenames))
xs <- retcor(faahko, method="obiwarp")
The algorithm is the same, just a different interface. Old xcmsSet/Obiwarp might have some issues, so not anymore recommended, but ok for debugging.
Yours, Steffen
And if you want to run this without parallel processing simply call it with
adjustRtime(..., BPPARAM = SerialParam()
or alternatively (to disable parallel processing for all xcms
functions:
register(SerialParam())
@jorainer
I ran:
>register(serialParam())
>data <- adjustRtime(raw_data, param = ObiwarpParam())
and received:
Sample number 2 used as center sample.
Aligning LL2053_UPS2__Price_Standard_rep1O1.mzML against LL2053_UPS2__Price_Standard_rep2O1.mzML ... Found gaps in scan times of the center sample: cut scantime-vector at NA seconds.
Found gaps in scan time of file LL2053_UPS2__Price_Standard_rep1O1.mzML: cut scantime-vector at NA seconds.
Error in if (rtmaxdiff > (5 * mstdiff)) { : argument is of length zero
@sneumann
I ran:
files <- c(
"/Users/michael/work/correspondence/JC_raw/centroided/LL2053_UPS2__Price_Standard_rep1O1.mzML",
"/Users/michael/work/correspondence/JC_raw/centroided/LL2053_UPS2__Price_Standard_rep2O1.mzML",
"/Users/michael/work/correspondence/JC_raw/centroided/LL2053_UPS2__Price_Standard_rep3O1.mzML"
)
xs <- xcmsSet(files)
xs <- retcor(xs, method="obiwarp")
and received:
center sample: LL2053_UPS2__Price_Standard_rep1O1
Processing: LL2053_UPS2__Price_Standard_rep2O1 Create profile matrix with method 'bin' and step 1 ... OK
Create profile matrix with method 'bin' and step 1 ... OK
Found gaps: cut scantime-vector at 0.5113442 seconds
Found gaps: cut scantime-vector at 0.513545 seconds
libc++abi.dylib: terminating with uncaught exception of type std::bad_alloc: std::bad_alloc
zsh: abort R
Could you also try to do a data <- filterEmptySpectra(data)
before running adjustRtime
? This should remove empty spectra from the data set - just in case.
Another possibility would be to filter by retention time. What puzzles me it that it tells you that it cuts at an rt of 0.5 seconds - so maybe do a data <- filterRt(data, rt = c(1, max(rtime(data))))
to remove spectra recorded before an rt of 1 second.
# Get files
files <- dir("/Users/michael/work/correspondence/JC_raw/centroided", full.names = TRUE, pattern = "mzML$", recursive = TRUE)
# Build phenodata
pd <- data.frame(sample_name = "test",
sample_group = c(rep("1",1), rep("2", 1), rep("3", 1)),
stringsAsFactors = FALSE)
# Load raw data
raw_data <- readMSData(files = files, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk")
# Filter MS1
raw_data <- filterMsLevel(raw_data, msLevel = 1L)
# Filter empty spectra
raw_data <- filterEmptySpectra(raw_data)
# Subset raw_data in RT dimension 1..max_rt
raw_data <- filterRt(raw_data, rt = c(1, max(rtime(raw_data))))
# Run Obiwarp
raw_data <- adjustRtime(raw_data, param = ObiwarpParam())
=>
Sample number 2 used as center sample.
Aligning LL2053_UPS2__Price_Standard_rep1O1.mzML against LL2053_UPS2__Price_Standard_rep2O1.mzML ... Found gaps in scan times of the center sample: cut scantime-vector at NA seconds.
Found gaps in scan time of file LL2053_UPS2__Price_Standard_rep1O1.mzML: cut scantime-vector at NA seconds.
Error in if (rtmaxdiff > (5 * mstdiff)) { : argument is of length zero
Is this error basically stating that the difference in RT from spectra to spectra within a sample is greater than 5 * mean distance between spectra within a sample?
If I change the right-bound of the range in RT for filtering to 100 seconds, it doesn't error out.
raw_data <- filterRt(raw_data, rt = c(1, 100))
I've added a tentative fix that should at least avoid the error that is thrown. Can you please install that with devtools::install_github("sneumann/xcms", ref = "jomaster")
and try again?
Also, seems that there is something strange with the retention times in some of your files. Can you please check the rt range for your files?
rts <- split(rtime(data), fromFile(data))
lapply(rts, range)
I ran the install, and then commented out the line. The package "xcms" appears to have been updated in
/usr/local/lib/R/4.0/site-library/xcms
The dataset that I am using is the filtered set of spectra from 1..3600 (as recommended by @sneumann, just for your edification)
Also, forgive my ignorance with the package usages, I am rather new to R language and RStudio, but I do have a ruby/c++ development background in mass spec and thanks again for the amazing help!
Anyways.
# devtools::install_github("sneumann/xcms", ref = "jomaster"E)
# Get files
files <- dir("/Users/michael/work/correspondence/JC_raw/centroided", full.names = TRUE, pattern = "mzML$", recursive = TRUE)
# Build phenodata
pd <- data.frame(sample_name = "test",
sample_group = c(rep("1",1), rep("2", 1), rep("3", 1)),
stringsAsFactors = FALSE)
# Load raw data
raw_data <- readMSData(files = files, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk")
# # Filter MS1
# raw_data <- filterMsLevel(raw_data, msLevel = 1L)
#
# # Filter empty spectra
# raw_data <- filterEmptySpectra(raw_data)
#
# # Subset raw_data in RT dimension 1..max_rt
# raw_data <- filterRt(raw_data, rt = c(1, max(rtime(raw_data))))
# Run Obiwarp
raw_data <- adjustRtime(raw_data, param = ObiwarpParam())
=>
Sample number 2 used as center sample.
Aligning LL2053_UPS2__Price_Standard_rep3O1.mzML against LL2053_UPS2__Price_Standard_rep2O1.mzML ... Found gaps in scan times of the center sample: cut scantime-vector at NA seconds.
Found gaps in scan time of file LL2053_UPS2__Price_Standard_rep3O1.mzML: cut scantime-vector at NA seconds.
Error: BiocParallel errors
element index: 1, 2
first error: argument is of length zero
...
rts <- split(rtime(data), fromFile(data))
print(lapply(rts, range))
=>
$`1`
[1] 0.5113442 869.2116754
$`2`
[1] 0.513545 874.731671
$`3`
[1] 0.5001967 890.4193777
Here is a snippet of the first 10 rts for each file:
F1:
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 F1.S0007 F1.S0008 F1.S0009 F1.S0010
0.5113442 0.5666625 2.5963471 2.6626127 3.3369812 3.4070735 4.4129045 5.0772575 5.7421003 5.8551376
F2:
F2.S0001 F2.S0002 F2.S0003 F2.S0004 F2.S0005 F2.S0006 F2.S0007 F2.S0008 F2.S0009 F2.S0010
0.5135450 0.5748717 1.1039869 1.5542145 2.3234385 2.5108162 2.8774145 2.9913400 3.0615366 3.6569360
F3:
F3.S0001 F3.S0002 F3.S0003 F3.S0004 F3.S0005 F3.S0006 F3.S0007 F3.S0008 F3.S0009 F3.S0010
0.5001967 0.5512692 0.9897637 1.1963838 2.2049644 2.2941260 2.3448350 3.6762752 3.7908262 3.8471546
Any chance of providing the slimmed mzMLs ? Ideally together with R script to reproduce. I can't promise timing, but I'd try to reproduce locally. Yours, Steffen
What surprises me is that you don't have continuous retention times (e.g. for the first file 0.51, 0.56
but then it jumps to 2.59, 2.66
and then again to 3.33
). Was your data recorded that way or was there some filtering going on that removed spectra in between?
@jorainer
It was recorded that way. In the above snippet MS2 scans are included.
Here is a snippet of MS1 only.
F1:
F1.S0001 F1.S0005 F1.S0010 F1.S0013 F1.S0015 F1.S0017 F1.S0020 F1.S0021 F1.S0023
0.5113442 3.3369812 5.8551376 8.0251921 9.1990242 9.7030199 11.9109058 12.3269175 13.4657453
@sneumann @jorainer
Here is a link to a google drive with the snippets (which are spectra 0..3600) in .mzML and the full .raw files. These are public domain datasets, so I am fine sharing them.
No worries about the timing, I am very excited just to be helped!
Hi, first I can confirm the bad_array
. So the good news: it's not not you.
Based on your R script in the google drive above, I poked around a bit.
## What RT do we have
firstFile <- fData(raw_data)[,"fileIdx"]==1
plot(rtime(raw_data)[firstFile])
## and what differences
deltaRt <- rtime(raw_data)[firstFile][2:1600]-rtime(raw_data)[firstFile][1:1599]
plot(deltaRt)
Looking at the mzML it was measured on "Orbitrap Fusion" and converted with pwiz. There are 3600 scans, and they cover ~14 minutes ("scan start time" given in minutes). Something is odd regarding the RT stuff. While I didn't debug where it breaks exactly, I see that inter-scan times vary from <0.5 to 3sec , confirming Jo's observation. Yours, Steffen
Interesting findings. I am not the original creator of the data, but I have the ability to reach out to them if it is necessary. These files are technical replicates that we're using to test a feature detection, alignment, and correspondence workflow. Let me know if I can do anything or assist in any manner.
I am curious though, even with a high deltaRt, shouldn't an alignment still be possible (although, the alignment quality might suffer)? Just speaking from my experience with Obiwarp and its matrix alignment.
I guess alignment should be possible. The problem is that there is this limitation in the xcms
code to cut the data if it finds gaps in retention times - could be that this has to do with the data it sends to the C++ code of obiwarp. That I don't know as I've not written or touched this part of the code.
Thank you @jorainer @sneumann, I appreciate your time discussing this matter with me.
It sounds like the characteristics of this file make it something that the pre-processing steps within xcms
and before obiwarp
weren't designed to handle (the gaps in RT).
This is outside the purview of the issue at hand, but for discussions sake. I am looking for some additional proteomics datasets (ideally technical or biological replicates) that exhibit high retention time deviation between runs i.e features that occur at significantly different retention times. These will be used to test a new open-source warping-free correspondence algorithm.
Once again, thank you.
Michael
Re data set: sorry, I don't know any such data sets.
Re pre-processing handling gaps: the peak groups alignment should be able to work also with such data sets - only the obiwarp
implementation in xcms
will fail on that.
Thank you, I'll look into that!
Hello, and thank you in advance for any nudges, help or discussion.
I am trying to run alignment on three large Proteomics datasets using Obiwarp. They're each about 800 Mb. I am fairly familiar with Obiwarp and have an implementation of it that works with these files but naturally there are difference in the implementation of Obiwarp in XCMS and my implementation i.e the building and passing of the matrix to Obiwarp, etc.
I tried running the workflow in XCMS using this R script:
files <- dir("~/JC", full.names = TRUE, pattern = ".mzML$", recursive = TRUE)
print(files)
pd <- data.frame(sample_name = "test", sample_group = c(rep("1",1), rep("2", 1), rep("3", 1)), stringsAsFactors = FALSE)
raw_data <- readMSData(files = files, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk")
raw_data <- filterMsLevel(raw_data, msLevel = 1L)
data <- adjustRtime(raw_data, param = ObiwarpParam(binSize = .6))
When I run it through RStudio, I get:
Sample number 2 used as center sample. LLVM ERROR: out of memory LLVM ERROR: out of memory Error in result[[njob]] <- value : attempt to select less than one element in OneIndex In addition: Warning message: In parallel::mccollect(wait = FALSE, timeout = 1) : 2 parallel jobs did not deliver results
When I run it through the CLI, I get:
Sample number 2 used as center sample. Error in x$.self$finalize() : attempt to apply non-function terminate called after throwing an instance of std::bad_array_new_length what(): std::bad_array_new_length Error: 'bplapply' receive data failed: error reading from connection terminate called after throwing an instance of std::bad_array_new_length what(): std::bad_array_new_length
I tried running it on my mac with 16 gb of memory and the rsessions (viewing through Top or Activity Monitor) quickly get into swap.
I then tried a beefy EC2 instance with 32 gb of RAM with the same results.
I have run large files through the same scripts and had success.
I converted these files from .raw to .mzML/.mzXML using msconvert and centroided them using OpenMS.
Do you think these is simply an out of memory issue, or is it a malformed file?
Any help or advice is much appreciated.