sneumann / mzR

This is the git repository matching the Bioconductor package mzR: parser for netCDF, mzXML, mzData and mzML files (mass spectrometry data)
40 stars 26 forks source link

sampleInfo concatenation and userParam #200

Open stanstrup opened 5 years ago

stanstrup commented 5 years ago

Hello,

I was wondering if there is a way to extract metadata in a more direct way or extract more details info.

As a test I added the following to a file:

    <sampleList count="1">
      <sample id="org_filename.raw" name="Important sample">
     <userParam name="Job Code" value="Some project"/>
      </sample>
    </sampleList>

sampleInfo(mz) returns "Important sampleorg_filename.raw", so it seems id and name was concatenation without a separator.

So 1) is there a way to access these fields individually? 2) possible to add a separator? 3) Wouldn't the more natural order also be id+name and not name+id? 4) Then, what about the userParam? Is there a way to access that? 5) Is it possible to inject metadata (e.g. sampleInfo(mz) <- list(sample="something else"), or with writeMSData)? I guess I could loose more than I gain though since not all metadata is transfered (https://github.com/sneumann/mzR/issues/159).

Related to actually writing the info I need with Proteowizard: https://github.com/ProteoWizard/pwiz/issues/568#issue-454695861

sneumann commented 5 years ago

Hi, let's start with a few pointers: Here is the header file and the C++ structure where the information is stored in pwiz: https://github.com/sneumann/mzR/blob/beb109476546d58a903eacd3d263e07f94d35a58/src/pwiz/data/msdata/MSData.hpp#L108 Here is the XML parsing of that element in pwiz: https://github.com/sneumann/mzR/blob/beb109476546d58a903eacd3d263e07f94d35a58/src/pwiz/data/msdata/IO.cpp#L568 and - most relevant in this context - is where mzR is parsing this back out: https://github.com/sneumann/mzR/blob/753bf97e5eb20854156740ba5a06f3ac00754c96/src/RcppPwiz.cpp#L140

The Sample is a ParamType as declared here: https://github.com/sneumann/mzR/blob/beb109476546d58a903eacd3d263e07f94d35a58/src/pwiz/data/common/ParamTypes.hpp#L244 so I expect the <userParam> will get read by pwiz, and could be extracted via userParam("Job Code") and then getting ->value() in https://github.com/sneumann/mzR/blob/beb109476546d58a903eacd3d263e07f94d35a58/src/pwiz/data/common/ParamTypes.hpp#L279 Currently, there is no code to do that in mzR. It might look similar to https://github.com/sneumann/mzR/blob/753bf97e5eb20854156740ba5a06f3ac00754c96/src/RcppPwiz.cpp#L229 Yours, Steffen

lgatto commented 5 years ago

Alternatively, this could be done in R with XML or xml2 - that my quick and dirty hack when a CV param isn't returned by default in mzR.

stanstrup commented 5 years ago

Yeah. I was thinking the same. Do you have a trick for only reading the header of the file and still getting valid XML?

lgatto commented 5 years ago

No trick in my hat, I'm afraid.

sneumann commented 5 years ago

But even if inefficient, we could have a function that basically does some XPath retrieval. Plus manpage with examples. Yours, Steffen

stanstrup commented 5 years ago

I made a little bit more challenging examples to make the solution more robust.

<sampleList count="2">
      <sample id="org_filename.raw" name="Important sample">
           <userParam name="Job Code" value="Some project"/>
           <userParam name="Other thing" value="Other value"/>
      </sample>
      <sample id="org_filename2.raw" name="Important sample2">
            <userParam name="Job Code" value="Some project2"/>
            <userParam name="Other thing" value="Other value"/>
      </sample>
</sampleList>

I can get what I want with:

library(xml2)
library(dplyr)
library(purrr)
data <- read_xml(file)

data %>% 
  xml_child("d1:mzML/d1:sampleList") %>%
  xml_find_all("d1:sample") %>% 
  map(xml_attr,"name") %>% 
  unlist()
[1] "Important sample"  "Important sample2"
data %>% 
  xml_child("d1:mzML/d1:sampleList") %>% 
  xml_find_all("d1:sample") %>% 
  map(xml_child,"d1:userParam[@name='Job Code']") %>% 
  map(xml_attr,"value") %>% 
  unlist()
[1] "Some project"  "Some project2"

It is not as slow as I had thought. It takes just 0.2 sec. I guess it would still be nice to have some generic way to access the complete metadata, though.