Open tsufz opened 6 years ago
Dear Tobias,
I'm the culprit for problems in writeMSData
. The spectrum ID that is saved is either what is provided in column "spectrumId"
of the data.frame
passed with the header
parameter or, if that is not defined, it is set to "scan="<acquisition number>
. We are relying on proteowizard for the mzML export, this means that we're setting this provided ID as the ID of a proteowizard's Spectrum
object. It could be that proteowizard translates the ID into something like you described during the export. I've never specifically looked at that What I know is that proteowizard has routines to extract the acquisition number from the spectrum ID that, depending on the vendor, can be something like: id="controllerType=0 controllerNumber=1 scan=1"
(Thermo MS).
Summarizing, we don't paste controllerType, scan ID etc, it could be that such IDs are taken from the original file, if you are using writeMSData
to e.g. subset an MS data read from an mzML file, or it could be that proteowizard does this during mzML export.
Hi, okay, maybe it is a MSnbase related problem. I will check. TXS
Hi, okay, maybe it is a MSnbase related problem. I will check. TXS
No, as MSnbase
and mzR
do the same thing with the same underlying code.
Hi, Can you give input, code snippet and output? Does it happen with any of the public data, e.g. From msdata package? Yours, Steffen
I blame Android for the brevity and typos
Hi, The input mzML after pwiz export seams to be OK. The fields are separated.
The code for file import is:
MSFILE <- MSnbase::readMSData(file.path(dir_in,files[i]),msLevel. = NULL, mode = "onDisk")
.
The code for file export is:
MSnbase::writeMSData(MSFILE,file.path(dir_out,files[i]),outformat="mzml")
There is no change wheather I set msLevel. = 1
or msLevel. = 2
or msLevel. = Null
if the mode
is set to onDisk
, but in case of mode = "inMemory
output is Id = scanID
. However, in the latter case, the controllerType
and controllerNumber
is missing.
I need the onDisk
option in order to compute DIA data.
In my opinion, the onDisk
and inMemory
functions should be reviewed. Obviously, different outputs are generated. This should be not happen, I would expect a correct in-out without unwanted manipulation.
In addition, the field with the scanID
is Scan = Integer
. I dunno understand why this integer
field is translated to Id = scan=Integer
(e.g. Id = scan=1
)? For me, this is an unwanted manipulation.
In my opinion, the
onDisk
andinMemory
functions should be reviewed.
Thank you for your constructive feedback. If you have time to contribute, please open an issue in MSnbase
so that we can discuss further.
As for the problem at hand, could you share one file so that we can inspect exactly how the input looks like, how the data is imported, and eventually serialised to disk.
if the
mode
is set toonDisk
, but in case ofmode = "inMemory
output isId = scanID
. However, in the latter case, thecontrollerType
andcontrollerNumber
is missing.
inMem
and onDisk
data is differently represented. inMem
stores the data into Spectrum
objects while onDisk
contain only the spectrum header (i.e. what is returned by mzR::header
). Now, Spectrum
objects don't have an ID field/attribute, that's why for inMem
the ID of the spectrum has to be upon export artificially created.
The only solution I see here is that inMem
(i.e. the MSnExp
also keeps the mzR::header
in the featureData
slot (same as the onDisk
OnDiskMSnExp
object) and that missing information not stored in the Spectrum
objects (such as the spectrum ID) is extracted from this header data.frame upon export.
I dunno understand why this
integer
field is translated toId = scan=Integer
(e.g.Id = scan=1
)? For me, this is an unwanted manipulation.
Here we wanted to be compliant with proteowizard. pwiz creates scan IDs and extracts acquisition numbers from scan IDs using e.g.:
PWIZ_API_DECL string translateScanNumberToNativeID(CVID nativeIdFormat, const string& scanNumber)
{
switch (nativeIdFormat)
{
case MS_Thermo_nativeID_format:
return "controllerType=0 controllerNumber=1 scan=" + scanNumber;
case MS_spectrum_identifier_nativeID_format:
return "spectrum=" + scanNumber;
case MS_multiple_peak_list_nativeID_format:
return "index=" + scanNumber;
case MS_Agilent_MassHunter_nativeID_format:
return "scanId=" + scanNumber;
case MS_Bruker_Agilent_YEP_nativeID_format:
case MS_Bruker_BAF_nativeID_format:
case MS_scan_number_only_nativeID_format:
return "scan=" + scanNumber;
default:
return "";
}
}
Hi Steffen, Is there any reason why the controllerType controllerType and the scan# is merged in one ID by writeMSData?
Yours Tobias
Matrix products: default
locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages: [1] parallel stats graphics grDevices utils datasets methods base
other attached packages: [1] MSnbase_2.4.2 ProtGenerics_1.10.0 BiocParallel_1.12.0 mzR_2.12.0 Rcpp_0.12.16 Biobase_2.38.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached): [1] IRanges_2.12.0 zlibbioc_1.24.0 doParallel_1.0.11 munsell_0.4.3 colorspace_1.3-2 impute_1.52.0
[7] lattice_0.20-35 rlang_0.2.0 foreach_1.4.4 plyr_1.8.4 tools_3.4.4 mzID_1.16.0
[13] grid_3.4.4 gtable_0.2.0 affy_1.56.0 iterators_1.0.9 digest_0.6.15 lazyeval_0.2.1
[19] tibble_1.4.2 preprocessCore_1.40.0 affyio_1.48.0 ggplot2_2.2.1 S4Vectors_0.16.0 codetools_0.2-15
[25] MALDIquant_1.17 limma_3.34.9 BiocInstaller_1.28.0 compiler_3.4.4 pillar_1.2.1 pcaMethods_1.70.0
[31] scales_0.5.0 stats4_3.4.4 XML_3.98-1.10 vsn_3.46.0