sneumann / mzR

This is the git repository matching the Bioconductor package mzR: parser for netCDF, mzXML, mzData and mzML files (mass spectrometry data)
40 stars 26 forks source link

Merged ID in writeMSData #157

Open tsufz opened 6 years ago

tsufz commented 6 years ago

Hi Steffen, Is there any reason why the controllerType controllerType and the scan# is merged in one ID by writeMSData?

Yours Tobias

sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] MSnbase_2.4.2 ProtGenerics_1.10.0 BiocParallel_1.12.0 mzR_2.12.0 Rcpp_0.12.16 Biobase_2.38.0 BiocGenerics_0.24.0

loaded via a namespace (and not attached): [1] IRanges_2.12.0 zlibbioc_1.24.0 doParallel_1.0.11 munsell_0.4.3 colorspace_1.3-2 impute_1.52.0
[7] lattice_0.20-35 rlang_0.2.0 foreach_1.4.4 plyr_1.8.4 tools_3.4.4 mzID_1.16.0
[13] grid_3.4.4 gtable_0.2.0 affy_1.56.0 iterators_1.0.9 digest_0.6.15 lazyeval_0.2.1
[19] tibble_1.4.2 preprocessCore_1.40.0 affyio_1.48.0 ggplot2_2.2.1 S4Vectors_0.16.0 codetools_0.2-15
[25] MALDIquant_1.17 limma_3.34.9 BiocInstaller_1.28.0 compiler_3.4.4 pillar_1.2.1 pcaMethods_1.70.0
[31] scales_0.5.0 stats4_3.4.4 XML_3.98-1.10 vsn_3.46.0

jorainer commented 6 years ago

Dear Tobias,

I'm the culprit for problems in writeMSData. The spectrum ID that is saved is either what is provided in column "spectrumId" of the data.frame passed with the header parameter or, if that is not defined, it is set to "scan="<acquisition number>. We are relying on proteowizard for the mzML export, this means that we're setting this provided ID as the ID of a proteowizard's Spectrum object. It could be that proteowizard translates the ID into something like you described during the export. I've never specifically looked at that What I know is that proteowizard has routines to extract the acquisition number from the spectrum ID that, depending on the vendor, can be something like: id="controllerType=0 controllerNumber=1 scan=1" (Thermo MS).

Summarizing, we don't paste controllerType, scan ID etc, it could be that such IDs are taken from the original file, if you are using writeMSData to e.g. subset an MS data read from an mzML file, or it could be that proteowizard does this during mzML export.

tsufz commented 6 years ago

Hi, okay, maybe it is a MSnbase related problem. I will check. TXS

lgatto commented 6 years ago

Hi, okay, maybe it is a MSnbase related problem. I will check. TXS

No, as MSnbase and mzR do the same thing with the same underlying code.

sneumann commented 6 years ago

Hi, Can you give input, code snippet and output? Does it happen with any of the public data, e.g. From msdata package? Yours, Steffen


I blame Android for the brevity and typos

tsufz commented 6 years ago

Hi, The input mzML after pwiz export seams to be OK. The fields are separated.

The code for file import is: MSFILE <- MSnbase::readMSData(file.path(dir_in,files[i]),msLevel. = NULL, mode = "onDisk").

The code for file export is: MSnbase::writeMSData(MSFILE,file.path(dir_out,files[i]),outformat="mzml")

There is no change wheather I set msLevel. = 1 or msLevel. = 2 or msLevel. = Null if the mode is set to onDisk, but in case of mode = "inMemory output is Id = scanID. However, in the latter case, the controllerType and controllerNumber is missing.

I need the onDisk option in order to compute DIA data.

In my opinion, the onDisk and inMemory functions should be reviewed. Obviously, different outputs are generated. This should be not happen, I would expect a correct in-out without unwanted manipulation.

In addition, the field with the scanID is Scan = Integer. I dunno understand why this integer field is translated to Id = scan=Integer (e.g. Id = scan=1)? For me, this is an unwanted manipulation.

lgatto commented 6 years ago

In my opinion, the onDisk and inMemory functions should be reviewed.

Thank you for your constructive feedback. If you have time to contribute, please open an issue in MSnbase so that we can discuss further.

As for the problem at hand, could you share one file so that we can inspect exactly how the input looks like, how the data is imported, and eventually serialised to disk.

jorainer commented 6 years ago

if the mode is set to onDisk, but in case of mode = "inMemory output is Id = scanID. However, in the latter case, the controllerType and controllerNumber is missing.

inMem and onDisk data is differently represented. inMem stores the data into Spectrum objects while onDisk contain only the spectrum header (i.e. what is returned by mzR::header). Now, Spectrum objects don't have an ID field/attribute, that's why for inMem the ID of the spectrum has to be upon export artificially created.

The only solution I see here is that inMem (i.e. the MSnExp also keeps the mzR::header in the featureData slot (same as the onDisk OnDiskMSnExp object) and that missing information not stored in the Spectrum objects (such as the spectrum ID) is extracted from this header data.frame upon export.

I dunno understand why this integer field is translated to Id = scan=Integer (e.g. Id = scan=1)? For me, this is an unwanted manipulation.

Here we wanted to be compliant with proteowizard. pwiz creates scan IDs and extracts acquisition numbers from scan IDs using e.g.:

PWIZ_API_DECL string translateScanNumberToNativeID(CVID nativeIdFormat, const string& scanNumber)
{
    switch (nativeIdFormat)
    {
        case MS_Thermo_nativeID_format:
            return "controllerType=0 controllerNumber=1 scan=" + scanNumber;

        case MS_spectrum_identifier_nativeID_format:
            return "spectrum=" + scanNumber;

        case MS_multiple_peak_list_nativeID_format:
            return "index=" + scanNumber;

        case MS_Agilent_MassHunter_nativeID_format:
            return "scanId=" + scanNumber;

        case MS_Bruker_Agilent_YEP_nativeID_format:
        case MS_Bruker_BAF_nativeID_format:
        case MS_scan_number_only_nativeID_format:
            return "scan=" + scanNumber;

        default:
            return "";
    }
}