lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
124 stars 50 forks source link

Error in reading Agilent .d files converted to .mzML via msconvert #551

Closed jamesrgraham closed 9 months ago

jamesrgraham commented 3 years ago

Hello,

I have some Agilent .d files that I converted to .mzML on a linux server with msconvert installed via docker and wine.

All other file types convert to .mzML with no issues.

when I try to read in the Agilent .mzML files via readMSData, I get this error:

Error: Can not open file 0714_48mix_50uM_02.mzML! Original error was: Error in pwizModule$open(filename): [IO::HandlerBinaryDataArray] Unknown binary data type.

I've seen references to this error, but no solutions.

This error occurs both on the linux server as well as on my Mac.

I tried removing the <binaryData* tags, but then that gave me an istream error.

packageVersion("MSnbase") [1] ‘2.18.0’

I tried to update MSnbase, but it keeps installing this version.

`> sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Mojave 10.14.4

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base

other attached packages: [1] RColorBrewer_1.1-2 magrittr_2.0.1 MSnbase_2.18.0 ProtGenerics_1.24.0 [5] S4Vectors_0.30.0 Biobase_2.52.0 BiocGenerics_0.38.0 mzR_2.26.1
[9] Rcpp_1.0.7 MASS_7.3-54

loaded via a namespace (and not attached): [1] plyr_1.8.6 compiler_4.1.1 pillar_1.6.2
[4] BiocManager_1.30.16 iterators_1.0.13 zlibbioc_1.38.0
[7] tools_4.1.1 digest_0.6.27 ncdf4_1.17
[10] MALDIquant_1.20 preprocessCore_1.54.0 lifecycle_1.0.0
[13] tibble_3.1.4 gtable_0.3.0 lattice_0.20-44
[16] clue_0.3-59 pkgconfig_2.0.3 rlang_0.4.11
[19] foreach_1.5.1 cluster_2.1.2 IRanges_2.26.0
[22] vctrs_0.3.8 MsCoreUtils_1.4.0 grid_4.1.1
[25] glue_1.4.2 impute_1.66.0 R6_2.5.1
[28] fansi_0.5.0 XML_3.99-0.7 BiocParallel_1.26.2
[31] limma_3.48.3 ggplot2_3.3.5 scales_1.1.1
[34] pcaMethods_1.84.0 codetools_0.2-18 ellipsis_0.3.2
[37] mzID_1.30.0 colorspace_2.0-2 utf8_1.2.2
[40] affy_1.70.0 doParallel_1.0.16 munsell_0.5.0
[43] vsn_3.60.0 crayon_1.4.1 affyio_1.62.0 `

lgatto commented 3 years ago

When you say all other files, do you mean that one/some file(s) or the same series (same acquisitions/conversions) fail and other work?

jamesrgraham commented 3 years ago

Sorry I was unclear.

“All other files” mean files from other instruments (Waters, specifically).

lgatto commented 3 years ago

Ok, thank you.

I am not sure that there's anything that can be done on the MSnbase side here, and I do not have any experience with Agilent data. Maybe @jorainer has?

jorainer commented 3 years ago

I guess it's most likely the second point above. I've already seen a case where an additional binary array was added to the TIC in the mzML - could you maybe have a look into one of your mzML and see if you have a "non-standard data array" in there?

If that's the case you could try to avoid exporting the TIC to the mzML (e.g. with --chromatogramFilter "index[1-]" in the msconvert call).

Ultimately, it would be good if someone could have a go at updating mzR - we definitely need the new proteowizard libraries in there - but my C++ knowledge is too limited to do that - maybe you @lgatto ?

lgatto commented 3 years ago

I have little C++ skills and even less time, but hopefully, with a bit collective knowledge, we will get there.

jamesrgraham commented 3 years ago

Thank you both for your replies.

These were the first Agilent files I tried to convert (there were no warnings or anything from msconvert). Also, the converted mzML files were readable in Skyline. Not sure if that offers you any clues.

I’m not sure if it was a standard run or not, but I’ll ask. The mzML files definitely had the “binaryData” that was mentioned in previous issues.

I will try the export without TIC and see if that results in a readable file…though, I don’t know what that would do to the rest of my pipeline.

Thanks again, I appreciate your help!

james

jamesrgraham commented 3 years ago

The Agilent runs were QQQ MRM.

jorainer commented 3 years ago

For MRM data msconvert --chromatogramFilter "index[1-]" should fix the problem as it will export all chromatograms except the TIC. Alternatively, you could simply delete the one entry:

open the converted mzML file with an editor and delete that entry, i.e. delete everything (including) from

<binaryDataArray arrayLength="...

until the next (but including): </binaryDataArray>

Also, you should change the number of arrays for the TIC from 3 to 2 then, i.e. search for "total ion current" (that should be way before the lines that you deleted above) and change the

<binaryDataArrayList count="3"> to <binaryDataArrayList count="2">

jamesrgraham commented 3 years ago

Thank you for the suggestions!

I'm waiting for my IT folks to get docker up and running again, so I can test the conversion filter.

I did remove the tags, but then that yielded a different error when reading in, but I will try your method, as well.

jamesrgraham commented 3 years ago

Removing the <binaryDataArray arrayLength="... (there was only one section in the mzML file that had this) and changing the <binaryDataArrayList count="3"> to <binaryDataArrayList count="2"> (there was also only one) yielded the stream error:

Error: Can not open file [...] 0714_48mix_50uM_02.mzML! Original error was: Error in pwizModule$open(filename): [SpectrumList_mzML::create()] Bad istream.

But this is still on the mzML file that was converted WITH the TIC.

So, I'll wait until I can get the files converted without the TIC and try again.

Thanks so much for your help! james

jorainer commented 3 years ago

If I got you correctly, the data is from an MRM experiment, so the mzML file should only have chromatograms, but no spectra in it. If that's the case, you should read the files with readSRMData and not with readMSData.

jamesrgraham commented 3 years ago

noTIC is the full path to the file.

noticdata <- readSRMData(noTIC, pdata = NULL) Error: Can not open file /Users/graham/Documents/LCMS/KATIE/noTIC/0714_48mix_50uM_02.mzML! Original error was: Error in pwizModule$open(filename): [IO::HandlerBinaryDataArray] Unknown binary data type.

I converted the .d file to mzML via:

docker run -it --rm -e WINEDEBUG=-all -v /path/to/data:/data chambm/pwiz-skyline-i-agree-to-the-vendor-licenses wine msconvert 0714_48mix_50uM_02.d --mzML --chromatogramFilter "index[1-]" -o output3 `format: mzML m/z: Compression-None, 64-bit intensity: Compression-None, 32-bit rt: Compression-None, 64-bit ByteOrder_LittleEndian indexed="true" outputPath: output3 extension: .mzML contactFilename: runIndexSet:

spectrum list filters:

chromatogram list filters: index[1-]

filenames: 0714_48mix_50uM_02.d

processing file: 0714_48mix_50uM_02.d calculating source file checksums [ChromatogramListFactory] Ignoring wrapper: index[1-] writing output file: output3\0714_48mix_50uM_02.mzML`

This yields the same binary data error.

jamesrgraham commented 3 years ago

0714_48mix_50uM_02.mzML.zip

Attached is the converted mzML file using the --chromatogramFilter "index[1-]" filter.

jorainer commented 3 years ago

Hm, what puzzles me is that the file above still contains the TIC with the "non-standard data array" binary data type. could you maybe use --chromatogramFilter "index[2-]"? just to see if we get rid of the TIC in that way...

jamesrgraham commented 3 years ago

docker run -it --rm -e WINEDEBUG=-all -v /mnt/m176906/KATIE:/data chambm/pwiz-skyline-i-agree-to-the-vendor-licenses wine msconvert 0714_48mix_50uM_02.d --mzML --chromatogramFilter "index[2-]" -o output4


    m/z: Compression-None, 64-bit
    intensity: Compression-None, 32-bit
    rt: Compression-None, 64-bit
ByteOrder_LittleEndian
 indexed="true"
outputPath: output4
extension: .mzML
contactFilename:
runIndexSet:

spectrum list filters:

chromatogram list filters:
  index[2-]

filenames:
  0714_48mix_50uM_02.d

processing file: 0714_48mix_50uM_02.d
calculating source file checksums
[ChromatogramListFactory] Ignoring wrapper: index[2-]
writing output file: output4\0714_48mix_50uM_02.mzML```

[0714_48mix_50uM_02.mzML.zip](https://github.com/lgatto/MSnbase/files/7136916/0714_48mix_50uM_02.mzML.zip)
jamesrgraham commented 3 years ago

Not sure what happened to the mzML file I attached...

jorainer commented 3 years ago

Hm, seems the link to the file is within ``` (i.e. formatted as code) - can you please add it again?

jamesrgraham commented 3 years ago

0714_48mix_50uM_02newname.mzML.zip

jamesrgraham commented 3 years ago

There you go.

jorainer commented 3 years ago

The TIC and the problematic data array is still in this file - this is for sure the correct file you sent me? seems that msconvert is not applying the filter (although it shows it).

jamesrgraham commented 3 years ago

Yeah, it was the “converted” file.

Do you know what the “ignoring wrapper” part in The output means?

jorainer commented 3 years ago

Ah, I've overlooked that before. Seems that msconvert is ignoring this filter? Maybe try with --chromatogramFilter "1-" instead? Problem is that the chromatogram filters are not documented (or at least I did not find a documentation for them).

jamesrgraham commented 3 years ago

I tried with 1- and 2- and both yielded:

[ChromatogramListFactory] Ignoring wrapper: 1-

[ChromatogramListFactory] Ignoring wrapper: 2-

I also tried it with the following syntax:

--chromatogramFilter "index[2-]"

Which also yielded the "ignoring wrapper" warning.

Attached is a tarball of the two files, but they are the same size, so the filters are likely not working.

filter1_2.tar.gz

jorainer commented 3 years ago

Hm, but then it seems that there is a problem with msconvert.

jorainer commented 3 years ago

After trying myself I think the problem is a missing whitespace in the filter definition. It should be --chromatogramFilter "index [1,]". Sorry for that.

jamesrgraham commented 3 years ago

That worked! Thank you.

Two mzML files below: index1 and index2.

I'll try reading them in myself in a bit...

index12.tar.gz

jamesrgraham commented 3 years ago

I know this isn't an msconvert forum, but:

Is the --chromatogramFilter "index [1,]" flag only an option in the msconvert command line version?

I tried with the GUI version and did not see any of the chromatogram filters (just wanted to control for something wrong with the command line version installed).

jorainer commented 3 years ago

Honestly, I've no idea. I'm only using the command line version from the docker image. The problem also is that the documentation on the chromatogram filters is pretty scarse.

jamesrgraham commented 3 years ago

Yeah, I've found the same.

I do very much appreciate your efforts, thank you.

jorainer commented 3 years ago

The developmental mzR version with an updated proteowizard code is available. With this version it should be possible to read the mzML files. It might take some time until this version becomes "stable" because we had to remove the ramp backend and hence mzData support. To install:

BiocManager::install("sneumann/mzR", ref = "feature/updatePwiz_3_0_21263")
jamesrgraham commented 3 years ago

Thank you!

I have installed it. Is there a special library call to invoke that version? I installed it, reloaded the mzR library, and got the same error, so I am likely doing something wrong.

lgatto commented 3 years ago

I think you'll need to restart R.

jamesrgraham commented 2 years ago

Sorry for the delay in getting back to you. This is what I'm getting now: `

raw_data12 <- readSRMData(files = FILES[1]) Error in readSRMData(files = FILES[1]) : file(s) '/Users/graham/Documents/LCMS/KATIE/mzML/0714_48mix_50uM_02.mzML' do not contain SRM chromatogram data raw_data12 <- readMSData(files = FILES[1]) Warning message: In readMSData(files = FILES[1]) : Dropping 1 file(s) without any spectra: 0714_48mix_50uM_02.mzML. They/it contain(s) chromatograms and can be read with `readSRMData()`. `

jorainer commented 2 years ago

I just traid with the mzML file (you shared previously) and was able to read it. I guess there was some problem installing the updated mzR version? That's the version I've installed locally:

> packageVersion("mzR")
[1] ‘2.27.2’

If you're on Windows it can happen that installation from github fails due to some warnings - and mzR usually throws warnings during installation (e.g. because of different Rcpp version etc). Please try installing using the following commands:

Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")
BiocManager::install("sneumann/mzR", ref = "feature/updatePwiz_3_0_21263")

and then check for the installed package version.

jamesrgraham commented 2 years ago

OK. Got it to read in the file!

Thank you!

Now, to examine the data...

jamesrgraham commented 2 years ago

Still haven't examined the data as other priorities have come up.

Would this new library have any issues reading in other (non-Agilent) files? I tried to read in an mzML file that I can read in on my server at work, but on my local Mac (and windows), I get:

> file.path <- "/Users/graham/Documents/LCMS/GUY/OK1/OK1_BAs_Mix05.mzML"
> raw_data12 <- readMSData(files = file.path, mode = "onDisk", centroided = FALSE)
Error: Can not open file /Users/graham/Documents/LCMS/GUY/OK1/OK1_BAs_Mix05.mzML! Original error was: Error in pwizModule$open(filename): [SpectrumList_mzML::create()] Bad istream.
jorainer commented 2 years ago

This new version should not have any problems with older mzML files. The problems you get are actually very strange, because I did not get any errors reading old (or newly) converted mzML files on linux or macOS. Can you try to read these files also with the original mzR version to see if you would get the same error there?

jamesrgraham commented 2 years ago

I did a fresh install on a new Windows machine that never had R installed before and DID get the error. But the very same file can be read in without issue on my server.

Perhaps some more basic libraries need to be installed?

jamesrgraham commented 2 years ago

It must be the file itself. I tried another file and was able to read it no problem. I'm transferring more files to my local machine for testing.

Another issue I noted with a couple of the files (they are blanks), is that one compound crashes my pipeline, but only in a couple of the files (a divide by zero in some mz manipulation, so probably not an mzR issue, I guess). But, maybe this is related. But the files read in fine on the server...which is weird.

jorainer commented 2 years ago

I'm also experiencing some random strange errors on macOS from time to time (segmentation faults or reading mzML files fail) - but they never happen on linux. I have no explanation or solution for these, unfortunately...

jamesrgraham commented 2 years ago

I've gotten the readSRM() to work on the files.

I know I'm now straying from the initial problem, but how does one extract the mz/rt/areas?

I used findChromPeaks() and wrote out the result, which looks like this:

new("XChromatogram", chromPeaks = c(7.64548333333333, 29.9903833333333, 0.00971666666666667, 25.6858, 20.6782, 29.9903833333333, 56054.1221246123, 391.131991402045, 2183539.35853297, 138058.815387333, 358614.59375, 150.280014038086, 152685.203974495, 65704.7064166921, 80.3030163568214, 34.5566300908969), chromPeakData = new("DFrame", rownames = NULL, nrows = 2, listData = list(ms_level = c(1, 1), is_filled = c(FALSE, FALSE)), elementType = "ANY", elementMetadata = NULL, metadata = list()), rtime = c(0.00971666666666667, 
0.0205333333333333, 0.0313333333333333, 0.04215, 0.0529666666666667, 0.0637833333333333, 0.0746, 0.0854166666666667, 0.0962333333333333, 0.10705, 0.117866666666667, 0.128683333333333, 0.1395, 0.150316666666667, 0.161133333333333, 0.17195, 0.18275, 0.193566666666667, 0.204383333333333, 0.2152, 0.226016666666667, 0.236833333333333, 0.24765, 0.258466666666667, 0.269283333333333, 0.2801, 0.290916666666667, 0.301733333333333, 0.31255, 0.323366666666667, 0.334183333333333, 0.344983333333333, 0.3558, 0.366616666666667, 

...and the list goes on.

Is there a way to extract such information as if I used readMSData()?

The goal is a list of what's in the file with mz, RT, and peak area.

Thank you.

jamesrgraham commented 2 years ago

I think I figured it out by using filterIntensity.