sneumann / mzR

This is the git repository matching the Bioconductor package mzR: parser for netCDF, mzXML, mzData and mzML files (mass spectrometry data)
42 stars 26 forks source link

copyWriteMSdata() dropping CV terms #159

Open trljcl opened 6 years ago

trljcl commented 6 years ago

Hello, I am having problems using copyWriteMSdata (to manipulate peak data), from mzML files that have originated from an Orbitrap Fusion. I am using mzR_2.12.0. Specifically, certain CV fields are being dropped that are required for downstream processes:

1) drops MS:1000512, which prevents "activation" being a usable filter when using msconvert to generate activation type -filtered output files

2) drops MS:1000827, which is used for parsing the precursor io when using msconvert for .mzML to .ms2:

Several other non-critical (at least for my needs) values are also dropped. Might this be due to ontology mismatches in the pwiz versions built in mzR vs mscovert? If so, how to align these? Or am I missing something obvious? Header portions of the orginal mzML file generated by msconvert, and then after using copyWriteMSdata(), pasted below to show the version differences.

1) from msconvert: <?xml version="1.0" encoding="utf-8"?>

2) After copyWriteMSdata() on the same file (extracting header and peaks and adding them back with no changes):
jorainer commented 6 years ago

The copyWriteMSdata does only copy general information (source file etc), but does not copy all fields from the spectras' headers. At present only the fields that are returned as columns in mzR::header are saved/exported to the mzML.

In order to save these values we would have to add additional columns to the header - which is not a big problem, but we have to know which. Apart from MS:1000512 and MS:1000827, are there any other terms you need?

trljcl commented 6 years ago

Hi Johannes, For now, adding those two CV terms are all I need. But would it not be better to dynamically scrape all CV term metadata from the source mzML file to be returned as separate columns in the header() function? This would future-proof copyWriteMSData() for vendor or mzML updates where new terms are introduced?

I'm slightly confused by the copyWriteMSData() functionally, especially when contrasted to writeMSdata(). The help vignettes suggest that in the copy function, metadata in the source mzML file is copied verbatim to a new mzML file, and subsequently only those linkable metadata fields that are specified in header() are updated. In contrast, my understanding is that the writeMSData() function exclusively uses the supplied header information to create metadata from scratch. I specifically used copyWriteMSData() because I knew the filterline CV term was not read in the mzR header() function, but wanted it passed untouched between source and copy mzML files. Have I misunderstood these differences?

thanks Tony

On 4 May 2018 at 06:42, Johannes Rainer notifications@github.com wrote:

The copyWriteMSdata does only copy general information (source file etc), but does not copy all fields from the spectras' headers. At present only the fields that are returned as columns in mzR::header are saved/exported to the mzML.

In order to save these values we would have to add additional columns to the header - which is not a big problem, but we have to know which. Apart from MS:1000512 and MS:1000827, are there any other terms you need?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sneumann/mzR/issues/159#issuecomment-386510081, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0sU55gVZNn2PcLMYcdi_5gnrejqyks5tu-o0gaJpZM4Tx3M0 .

-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK

Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)

tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry

jorainer commented 6 years ago

Dynamically scraping CV terms would be the ideal solution indeed - I'll have to check if and how that can be done. For now I will add the two CV parameters you suggested.

Regarding the difference between copyWriteMSData and writeMSData, probably the documentation was not clear on that. We just copy general metadata over. For spectra, only the data passed with the header parameter is saved.

jorainer commented 6 years ago

Note that more recent mzR versions do return the MS_filter_string in the "filterString" header column. Also, there have been some fixes to ensure that the exported mzML file is valid and can be opened with other software too. To use the newer mzR version you should update to R version 3.5 and related Bioconductor 3.7 (just released this Tuesday).

Regarding MS_isolation_window_target_m_z, that is currently returned for chromatographic data (i.e. by the chromatogramHeader function). You need that to be included also in the data.frame returned by header?

trljcl commented 6 years ago

OK, thanks - just updated to latest R and mzR versions - will test and get back to you.

Also, in case I want to do any future installs from github rather than bioconductor, I note github_install() fails because of:

In file included from rnetCDF.c:2:0: rnetCDF.h:1:20: fatal error: netcdf.h: No such file or directory

I note that this is probably because I am building on windows using Rtools (a necessary evil to chain conversions from Thermo .raw files onwards), and I note that even the latest version of Rtools doesn't contain netcdf libraries. Any idea on how to solve this?

thanks Tony

On 4 May 2018 at 09:49, Johannes Rainer notifications@github.com wrote:

Note that more recent mzR versions do return the MS_filter_string in the "filterString" header column. Also, there have been some fixes to ensure that the exported mzML file is valid and can be opened with other software too. To use the newer mzR version you should update to R version 3.5 and related Bioconductor 3.7 (just released this Tuesday).

Regarding MS_isolation_window_target_m_z, that is currently returned for chromatographic data (i.e. by the chromatogramHeader function). You need that to be included also in the data.frame returned by header?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sneumann/mzR/issues/159#issuecomment-386540751, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0ptETemdy0fnwR4euTjcTXZvYvGtks5tvBYkgaJpZM4Tx3M0 .

-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK

Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)

tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry

jorainer commented 6 years ago

You'll need to install the NetCDF library (https://www.unidata.ucar.edu/software/netcdf/docs/winbin.html) on Windows. Then github_install should work

trljcl commented 6 years ago

any advice on how to do this? I don't think it's straightforward for windows when looking at https://www.unidata.ucar.edu

On 4 May 2018 at 10:08, Johannes Rainer notifications@github.com wrote:

You'll need to install the NetCDF library on Windows. Then github_install should work

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sneumann/mzR/issues/159#issuecomment-386544850, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0rw2ZoKn5Mwv5OQkeQwZJ-6rYzjAks5tvBqUgaJpZM4Tx3M0 .

-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK

Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)

tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry

trljcl commented 6 years ago

OK, just tested with R 3.5.0 and mzR 2.14.0. All works fine now; i.e. openMSfile() and header() now include the extra fields, and subsequently copyWriteMSdata() results in a new .mzML file where the precursor mass is correctly parsed into .ms2 files by msconvert, and downstream activation type scan filtering in msconvert now works (e.g. by CID or HCD). Nothing else needed for now!

thanks Tony

On 4 May 2018 at 09:49, Johannes Rainer notifications@github.com wrote:

Note that more recent mzR versions do return the MS_filter_string in the "filterString" header column. Also, there have been some fixes to ensure that the exported mzML file is valid and can be opened with other software too. To use the newer mzR version you should update to R version 3.5 and related Bioconductor 3.7 (just released this Tuesday).

Regarding MS_isolation_window_target_m_z, that is currently returned for chromatographic data (i.e. by the chromatogramHeader function). You need that to be included also in the data.frame returned by header?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sneumann/mzR/issues/159#issuecomment-386540751, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0ptETemdy0fnwR4euTjcTXZvYvGtks5tvBYkgaJpZM4Tx3M0 .

-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK

Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)

tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry

trljcl commented 6 years ago

sorry, jumped the gun. In my previous post I had actually run the msconvert mzML to ms2 conversions using the mzML file generated from the original raw file using msconvert! There are now more problems with msconvert handling copyWriteMSdata() processed data. I've attached some toy files to demonstrate. Note msconvert was 3.0.18114. Files are: toy.mzML = file from msconvert raw to mzML; scans 1, 2 are full ms, scans 3, 5, 7 are ddms2 HCD and scans 4, 6 are ddms2 CID. Running this toy.mzML through msconvert to generate ms2 files with no (toy.ms2), HCD (toy_HCD.ms2), or CID (toy_CID.ms2) activation filtering gives expected results. If I then use mzR functions to open toy.mzML , extract pks and hdrs, and copyWriteMSdata() to toy_cw.mzML, this file is not readable in msconvert; I get an Invalid cvParam accession "1002869" error. There is no way I can see in the msconvert command line to ignore this, so I just edited the mzR CV from "MS:1002869" to "MS:-1" in the mzML file and saved as toy_cw_cvedit.mzML. (xcms doesn't care and will read in fine). When this is then processed through msconvert to the same corresponding ms2 files (toy_cw_cvedit.ms2, toy_cw_cvedit_HCD.ms2, toy_cw_cvedit_CID.ms2), these files are all incomplete, in that precursor m/z is dropped from all output and activation filtering is not working. I can only guess this is a proteowizard msconvert issue requiring a prescibed mzML input of some order hierarchy in addition to just containing the right CV terms? For now, I think I will generate my own ms2 files by parsing the the info from mzR hdr and pks, avoiding msconvert except for the first conversion from vendor files.

toy.zip

jorainer commented 6 years ago

Hi Tony,

thanks for the toy examples. I had a look at the mzML-produced output and to me it seems the only obvious difference is the lack of the CV terms MS:1000500 and MS:1000501 and that it does not contain the TIC chromatogram. I can add the two CV terms and you can check if that eventually fixes the problem you've seen.

Regarding MS:1002869, that's the new CV term for the mzR package. It has been added recently to https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo , so I guess proteowizard hasn't updated their list of CV terms yet.

jorainer commented 6 years ago

Can you please test if adding the scan windows fixes your problem? If so I will make a pull request against sneumann/mzR.

To install please use install_github("jotsetung/mzR", ref = "scanWindow").

trljcl commented 6 years ago

No, it doesn't, sorry!

On 7 May 2018 at 11:00, Johannes Rainer notifications@github.com wrote:

Can you please test if adding the scan windows fixes your problem? If so I will make a pull request against sneumann/mzR.

To install please use install_github("jotsetung/mzR", ref = "scanWindow").

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sneumann/mzR/issues/159#issuecomment-387018325, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0sfzDD4SdVn8Y-7-Mx-79ueuakWSks5twBsxgaJpZM4Tx3M0 .

-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK

Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)

tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry