std::bad_array_length while running Obiwarp

Michaelncallahan commented 4 years ago

Hello, and thank you in advance for any nudges, help or discussion.

I am trying to run alignment on three large Proteomics datasets using Obiwarp. They're each about 800 Mb. I am fairly familiar with Obiwarp and have an implementation of it that works with these files but naturally there are difference in the implementation of Obiwarp in XCMS and my implementation i.e the building and passing of the matrix to Obiwarp, etc.

I tried running the workflow in XCMS using this R script:

files <- dir("~/JC", full.names = TRUE, pattern = ".mzML$", recursive = TRUE)

print(files)

pd <- data.frame(sample_name = "test", sample_group = c(rep("1",1), rep("2", 1), rep("3", 1)), stringsAsFactors = FALSE)

raw_data <- readMSData(files = files, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk")

raw_data <- filterMsLevel(raw_data, msLevel = 1L)

data <- adjustRtime(raw_data, param = ObiwarpParam(binSize = .6))

When I run it through RStudio, I get:

Sample number 2 used as center sample. LLVM ERROR: out of memory LLVM ERROR: out of memory Error in result[[njob]] <- value : attempt to select less than one element in OneIndex In addition: Warning message: In parallel::mccollect(wait = FALSE, timeout = 1) : 2 parallel jobs did not deliver results

When I run it through the CLI, I get:

Sample number 2 used as center sample. Error in x$.self$finalize() : attempt to apply non-function terminate called after throwing an instance of std::bad_array_new_length what(): std::bad_array_new_length Error: 'bplapply' receive data failed: error reading from connection terminate called after throwing an instance of std::bad_array_new_length what(): std::bad_array_new_length

I tried running it on my mac with 16 gb of memory and the rsessions (viewing through Top or Activity Monitor) quickly get into swap.

I then tried a beefy EC2 instance with 32 gb of RAM with the same results.

I have run large files through the same scripts and had success.

I converted these files from .raw to .mzML/.mzXML using msconvert and centroided them using OpenMS.

Do you think these is simply an out of memory issue, or is it a malformed file?

Any help or advice is much appreciated.

sneumann commented 4 years ago

Hi, I have very little hands-on experience running Obiwarp in xcms myself. Any chance you try single-threaded first ? Does it work to align just two files ? Can you use msconvert or OpenMS to take just the first, say, 3600 spectra from each file ? Yours, Steffen

Michaelncallahan commented 4 years ago

Thank you for the quick reply. I was certainly thinking about running them synchronously and I'll give that try. Is that a flag/param that I could pass to the ObiwarpParam.

Something like: data <- adjustRtime(raw_data, param = ObiwarpParam(binSize = .6, bpparam = ... )

I tried to subset the files (to align two files) and still had the same issue.

I will filter the first 3600 spectra and see what happens. Then maybe try n-spectra where n is greater than the working number to find where it stops working - if it works that is.

Michaelncallahan commented 4 years ago

Just to provide a quick update. I selected the first 3600 scans via msconvert, and encountered this error:

Sample number 2 used as center sample. Aligning LL2053_UPS2Price_Standard_rep3O1.mzML against LL2053_UPS2__Price_Standard_rep2O1.mzML ... Found gaps in scan times of the center sample: cut scantime-vector at NA seconds. Found gaps in scan time of file LL2053_UPS2Price_Standard_rep3O1.mzML: cut scantime-vector at NA seconds. Error: BiocParallel errors element index: 1, 2 first error: argument is of length zero

**Perusing the two previous issues with that problem.

sneumann commented 4 years ago

ok, next wild guess. You've used the cool and current xcms3 interface. Let's try plain old xcms1. If that works, we'll have something to debug and fix.

Instead of readMSData() you need an old xcmsSet Object.

help("retcor.obiwarp")
xs <- xcmsSet(c(filenames))
xs <- retcor(faahko, method="obiwarp")

The algorithm is the same, just a different interface. Old xcmsSet/Obiwarp might have some issues, so not anymore recommended, but ok for debugging.

Yours, Steffen

jorainer commented 4 years ago

And if you want to run this without parallel processing simply call it with adjustRtime(..., BPPARAM = SerialParam() or alternatively (to disable parallel processing for all xcms functions:

register(SerialParam())

Michaelncallahan commented 4 years ago

@jorainer

I ran:

>register(serialParam()) >data <- adjustRtime(raw_data, param = ObiwarpParam())

and received:

Sample number 2 used as center sample.
Aligning LL2053_UPS2__Price_Standard_rep1O1.mzML against LL2053_UPS2__Price_Standard_rep2O1.mzML ... Found gaps in scan times of the center sample: cut scantime-vector at NA seconds.
Found gaps in scan time of file LL2053_UPS2__Price_Standard_rep1O1.mzML: cut scantime-vector at NA seconds.
Error in if (rtmaxdiff > (5 * mstdiff)) { : argument is of length zero

@sneumann

I ran:

files <- c(
"/Users/michael/work/correspondence/JC_raw/centroided/LL2053_UPS2__Price_Standard_rep1O1.mzML",
"/Users/michael/work/correspondence/JC_raw/centroided/LL2053_UPS2__Price_Standard_rep2O1.mzML",
"/Users/michael/work/correspondence/JC_raw/centroided/LL2053_UPS2__Price_Standard_rep3O1.mzML"
)

xs <- xcmsSet(files)

xs <- retcor(xs, method="obiwarp")

and received:

center sample:  LL2053_UPS2__Price_Standard_rep1O1 
Processing: LL2053_UPS2__Price_Standard_rep2O1  Create profile matrix with method 'bin' and step 1 ... OK
Create profile matrix with method 'bin' and step 1 ... OK
Found gaps: cut scantime-vector at  0.5113442 seconds 
Found gaps: cut scantime-vector at  0.513545 seconds 
libc++abi.dylib: terminating with uncaught exception of type std::bad_alloc: std::bad_alloc
zsh: abort      R

jorainer commented 4 years ago

Could you also try to do a data <- filterEmptySpectra(data) before running adjustRtime? This should remove empty spectra from the data set - just in case.

Another possibility would be to filter by retention time. What puzzles me it that it tells you that it cuts at an rt of 0.5 seconds - so maybe do a data <- filterRt(data, rt = c(1, max(rtime(data)))) to remove spectra recorded before an rt of 1 second.

Michaelncallahan commented 4 years ago

# Get files
files <- dir("/Users/michael/work/correspondence/JC_raw/centroided", full.names = TRUE, pattern = "mzML$", recursive = TRUE)

# Build phenodata
pd <- data.frame(sample_name = "test",
                 sample_group = c(rep("1",1), rep("2", 1), rep("3", 1)),
                 stringsAsFactors = FALSE)

# Load raw data
raw_data <- readMSData(files = files, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk")

# Filter MS1
raw_data <- filterMsLevel(raw_data, msLevel = 1L)

# Filter empty spectra
raw_data <- filterEmptySpectra(raw_data)

# Subset raw_data in RT dimension 1..max_rt
raw_data <- filterRt(raw_data, rt = c(1, max(rtime(raw_data))))

# Run Obiwarp
raw_data <- adjustRtime(raw_data, param = ObiwarpParam())

=>

Sample number 2 used as center sample.
Aligning LL2053_UPS2__Price_Standard_rep1O1.mzML against LL2053_UPS2__Price_Standard_rep2O1.mzML ... Found gaps in scan times of the center sample: cut scantime-vector at NA seconds.
Found gaps in scan time of file LL2053_UPS2__Price_Standard_rep1O1.mzML: cut scantime-vector at NA seconds.
Error in if (rtmaxdiff > (5 * mstdiff)) { : argument is of length zero

Is this error basically stating that the difference in RT from spectra to spectra within a sample is greater than 5 * mean distance between spectra within a sample?

Michaelncallahan commented 4 years ago

If I change the right-bound of the range in RT for filtering to 100 seconds, it doesn't error out.

raw_data <- filterRt(raw_data, rt = c(1, 100))

jorainer commented 4 years ago

I've added a tentative fix that should at least avoid the error that is thrown. Can you please install that with devtools::install_github("sneumann/xcms", ref = "jomaster") and try again?

Also, seems that there is something strange with the retention times in some of your files. Can you please check the rt range for your files?

rts <- split(rtime(data), fromFile(data))
lapply(rts, range)

Michaelncallahan commented 4 years ago

I ran the install, and then commented out the line. The package "xcms" appears to have been updated in

/usr/local/lib/R/4.0/site-library/xcms

The dataset that I am using is the filtered set of spectra from 1..3600 (as recommended by @sneumann, just for your edification)

Also, forgive my ignorance with the package usages, I am rather new to R language and RStudio, but I do have a ruby/c++ development background in mass spec and thanks again for the amazing help!

Anyways.

# devtools::install_github("sneumann/xcms", ref = "jomaster"E)

# Get files
files <- dir("/Users/michael/work/correspondence/JC_raw/centroided", full.names = TRUE, pattern = "mzML$", recursive = TRUE)

# Build phenodata
pd <- data.frame(sample_name = "test",
                 sample_group = c(rep("1",1), rep("2", 1), rep("3", 1)),
                 stringsAsFactors = FALSE)

# Load raw data
raw_data <- readMSData(files = files, pdata = new("NAnnotatedDataFrame", pd), mode = "onDisk")

# # Filter MS1
# raw_data <- filterMsLevel(raw_data, msLevel = 1L)
# 
# # Filter empty spectra
# raw_data <- filterEmptySpectra(raw_data)
# 
# # Subset raw_data in RT dimension 1..max_rt
# raw_data <- filterRt(raw_data, rt = c(1, max(rtime(raw_data))))

# Run Obiwarp
raw_data <- adjustRtime(raw_data, param = ObiwarpParam())

=>

Sample number 2 used as center sample.
Aligning LL2053_UPS2__Price_Standard_rep3O1.mzML against LL2053_UPS2__Price_Standard_rep2O1.mzML ... Found gaps in scan times of the center sample: cut scantime-vector at NA seconds.
Found gaps in scan time of file LL2053_UPS2__Price_Standard_rep3O1.mzML: cut scantime-vector at NA seconds.
Error: BiocParallel errors
  element index: 1, 2
  first error: argument is of length zero

...

rts <- split(rtime(data), fromFile(data))

print(lapply(rts, range))

=>

$`1`
[1]   0.5113442 869.2116754

$`2`
[1]   0.513545 874.731671

$`3`
[1]   0.5001967 890.4193777

Here is a snippet of the first 10 rts for each file:

F1:

   F1.S0001    F1.S0002    F1.S0003    F1.S0004    F1.S0005    F1.S0006    F1.S0007    F1.S0008    F1.S0009    F1.S0010 
  0.5113442   0.5666625   2.5963471   2.6626127   3.3369812   3.4070735   4.4129045   5.0772575   5.7421003   5.8551376

F2:

   F2.S0001    F2.S0002    F2.S0003    F2.S0004    F2.S0005    F2.S0006    F2.S0007    F2.S0008    F2.S0009    F2.S0010 
  0.5135450   0.5748717   1.1039869   1.5542145   2.3234385   2.5108162   2.8774145   2.9913400   3.0615366   3.6569360

F3:

   F3.S0001    F3.S0002    F3.S0003    F3.S0004    F3.S0005    F3.S0006    F3.S0007    F3.S0008    F3.S0009    F3.S0010 
  0.5001967   0.5512692   0.9897637   1.1963838   2.2049644   2.2941260   2.3448350   3.6762752   3.7908262   3.8471546

sneumann commented 4 years ago

Any chance of providing the slimmed mzMLs ? Ideally together with R script to reproduce. I can't promise timing, but I'd try to reproduce locally. Yours, Steffen

jorainer commented 4 years ago

What surprises me is that you don't have continuous retention times (e.g. for the first file 0.51, 0.56 but then it jumps to 2.59, 2.66 and then again to 3.33). Was your data recorded that way or was there some filtering going on that removed spectra in between?

Michaelncallahan commented 4 years ago

@jorainer

It was recorded that way. In the above snippet MS2 scans are included.

Here is a snippet of MS1 only.

F1:

   F1.S0001    F1.S0005    F1.S0010    F1.S0013    F1.S0015    F1.S0017    F1.S0020    F1.S0021    F1.S0023 
  0.5113442   3.3369812   5.8551376   8.0251921   9.1990242   9.7030199  11.9109058  12.3269175  13.4657453

Michaelncallahan commented 4 years ago

@sneumann @jorainer

Here is a link to a google drive with the snippets (which are spectra 0..3600) in .mzML and the full .raw files. These are public domain datasets, so I am fine sharing them.

No worries about the timing, I am very excited just to be helped!

ms files

sneumann commented 4 years ago

Hi, first I can confirm the bad_array. So the good news: it's not not you. Based on your R script in the google drive above, I poked around a bit.

## What RT do we have
firstFile <- fData(raw_data)[,"fileIdx"]==1
plot(rtime(raw_data)[firstFile])

## and what differences
deltaRt <- rtime(raw_data)[firstFile][2:1600]-rtime(raw_data)[firstFile][1:1599]
plot(deltaRt)

Looking at the mzML it was measured on "Orbitrap Fusion" and converted with pwiz. There are 3600 scans, and they cover ~14 minutes ("scan start time" given in minutes). Something is odd regarding the RT stuff. While I didn't debug where it breaks exactly, I see that inter-scan times vary from <0.5 to 3sec , confirming Jo's observation. Yours, Steffen Screenshot from 2020-06-10 09-16-23

Michaelncallahan commented 4 years ago

Interesting findings. I am not the original creator of the data, but I have the ability to reach out to them if it is necessary. These files are technical replicates that we're using to test a feature detection, alignment, and correspondence workflow. Let me know if I can do anything or assist in any manner.

I am curious though, even with a high deltaRt, shouldn't an alignment still be possible (although, the alignment quality might suffer)? Just speaking from my experience with Obiwarp and its matrix alignment.

jorainer commented 4 years ago

I guess alignment should be possible. The problem is that there is this limitation in the xcms code to cut the data if it finds gaps in retention times - could be that this has to do with the data it sends to the C++ code of obiwarp. That I don't know as I've not written or touched this part of the code.

Michaelncallahan commented 4 years ago

Thank you @jorainer @sneumann, I appreciate your time discussing this matter with me.

It sounds like the characteristics of this file make it something that the pre-processing steps within xcms and before obiwarp weren't designed to handle (the gaps in RT).

This is outside the purview of the issue at hand, but for discussions sake. I am looking for some additional proteomics datasets (ideally technical or biological replicates) that exhibit high retention time deviation between runs i.e features that occur at significantly different retention times. These will be used to test a new open-source warping-free correspondence algorithm.

Once again, thank you.

Michael

jorainer commented 4 years ago

Re data set: sorry, I don't know any such data sets.

Re pre-processing handling gaps: the peak groups alignment should be able to work also with such data sets - only the obiwarp implementation in xcms will fail on that.

Michaelncallahan commented 4 years ago

Thank you, I'll look into that!

sneumann / xcms

std::bad_array_length while running Obiwarp #481