lgatto / msidmatching

Sandbox for matching ids between raw and identification data
0 stars 0 forks source link

How does acquisitionNum get determined in mzR #2

Open thomasp85 opened 10 years ago

thomasp85 commented 10 years ago

Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:

id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

A simple case

id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

If we can get the mzR mapping I think a simple lookup table with the name of the IDFormats as keys and a regular expression that extracts the correct number from the spectrumID column in the mzIDpsm class as value would do the trick.

Then we could have:

require(stringr)
getConverter <- function(nativeID) {
 if (nativeID %in% names(lookup)) {
  regexp <- lookup[nativeID]
 } else {
  regexp <- nativeID
 }
 return(
  function(spectrumID) {
   str_extract(spectrumID, regexp)
  }
 )
}

which would easily allows us to add to the known cases, and let a user specify their own regular expressions if they are working with an esoteric ms data format.

lgatto commented 10 years ago

Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:

id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

A simple case

id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

No, I don't know. I have never used a Wiff file, so I'm not even sure how the former would look like once converted into mzML and read into R. Do you have a wiff file at hand? I am happy to convert it and give it a go.

If we can get the mzR mapping I think a simple lookup table with the name of the IDFormats as keys and a regular expression that extracts the correct number from the spectrumID column in the mzIDpsm class as value would do the trick.

Then we could have:

require(stringr)
getConverter <- function(nativeID) {
 if (nativeID %in% names(lookup)) {
  regexp <- lookup[nativeID]
 } else {
  regexp <- nativeID
 }
 return(
  function(spectrumID) {
   str_extract(spectrumID, regexp)
  }
 )
}

which would easily allows us to add to the known cases, and let a user
specify their own regular expressions if they are working with an
esoteric ms data format.

Yes, that seems a good way forward.

Laurent

thomasp85 commented 10 years ago

No I don't have any of the files. The examples were just picked semi randomly to display the difficulties. Furthermore I don't think we should schedule an investigation of all possible ms data formats. For this to be viable we need to get the info from the source code of the parser. As far as I remember it still uses ramp?

Den 05/02/2014 kl. 15.41 skrev Laurent Gatto notifications@github.com:

Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:

id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

A simple case

id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

No, I don't know. I have never used a Wiff file, so I'm not even sure how the former would look like once converted into mzML and read into R. Do you have a wiff file at hand? I am happy to convert it and give it a go.

If we can get the mzR mapping I think a simple lookup table with the name of the IDFormats as keys and a regular expression that extracts the correct number from the spectrumID column in the mzIDpsm class as value would do the trick.

Then we could have:

require(stringr)
getConverter <- function(nativeID) {
if (nativeID %in% names(lookup)) {
regexp <- lookup[nativeID]
} else {
regexp <- nativeID
}
return(
function(spectrumID) {
str_extract(spectrumID, regexp)
}
)
}

which would easily allows us to add to the known cases, and let a user
specify their own regular expressions if they are working with an
esoteric ms data format.

Yes, that seems a good way forward.

Laurent — Reply to this email directly or view it on GitHub.

lgatto commented 10 years ago

No I don't have any of the files. The examples were just picked semi randomly to display the difficulties. Furthermore I don't think we should schedule an investigation of all possible ms data formats. For this to be viable we need to get the info from the source code of the parser. As far as I remember it still uses ramp?

Yes, indeed.

Den 05/02/2014 kl. 15.41 skrev Laurent Gatto notifications@github.com:

Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:

id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

A simple case

id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

No, I don't know. I have never used a Wiff file, so I'm not even sure how the former would look like once converted into mzML and read into R. Do you have a wiff file at hand? I am happy to convert it and give it a go.

If we can get the mzR mapping I think a simple lookup table with the name of the IDFormats as keys and a regular expression that extracts the correct number from the spectrumID column in the mzIDpsm class as value would do the trick.

Then we could have:

require(stringr)
getConverter <- function(nativeID) {
if (nativeID %in% names(lookup)) {
regexp <- lookup[nativeID]
} else {
regexp <- nativeID
}
return(
function(spectrumID) {
str_extract(spectrumID, regexp)
}
)
}

which would easily allows us to add to the known cases, and let a user
specify their own regular expressions if they are working with an
esoteric ms data format.

Yes, that seems a good way forward.

Laurent — Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub: https://github.com/lgatto/msidmatching/issues/2#issuecomment-34183088

thomasp85 commented 10 years ago

Do you have contact to some of the spc folks who might know of the inner workings of RAMP or do we need to dive into the TPP source code?

On 05 Feb 2014, at 16:23, Laurent Gatto notifications@github.com wrote:

No I don't have any of the files. The examples were just picked semi randomly to display the difficulties. Furthermore I don't think we should schedule an investigation of all possible ms data formats. For this to be viable we need to get the info from the source code of the parser. As far as I remember it still uses ramp?

Yes, indeed.

Den 05/02/2014 kl. 15.41 skrev Laurent Gatto notifications@github.com:

Laurent, do you know what governs the choice of number for the acquisitionNum in the mzR header()?

As some of the native ID formats contains multiple integer values this is important to ensure correct indexing. Examples from the psi-ms.obo:

A complicated case:

id: MS:1000770
name: WIFF nativeID format
def: "sample=xsd:nonNegativeInteger period=xsd:nonNegativeInteger cycle=xsd:nonNegativeInteger experiment=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

A simple case

id: MS:1000772
name: Bruker BAF nativeID format
def: "scan=xsd:nonNegativeInteger." [PSI:MS]
is_a: MS:1000767 ! native spectrum identifier format

No, I don't know. I have never used a Wiff file, so I'm not even sure how the former would look like once converted into mzML and read into R. Do you have a wiff file at hand? I am happy to convert it and give it a go.

If we can get the mzR mapping I think a simple lookup table with the name of the IDFormats as keys and a regular expression that extracts the correct number from the spectrumID column in the mzIDpsm class as value would do the trick.

Then we could have:

require(stringr)
getConverter <- function(nativeID) {
if (nativeID %in% names(lookup)) {
regexp <- lookup[nativeID]
} else {
regexp <- nativeID
}
return(
function(spectrumID) {
str_extract(spectrumID, regexp)
}
)
}

which would easily allows us to add to the known cases, and let a user
specify their own regular expressions if they are working with an
esoteric ms data format.

Yes, that seems a good way forward.

Laurent — Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub: https://github.com/lgatto/msidmatching/issues/2#issuecomment-34183088 — Reply to this email directly or view it on GitHub.

lgatto commented 10 years ago

Do you have contact to some of the spc folks who might know of the inner workings of RAMP or do we need to dive into the TPP source code?

See https://github.com/sneumann/mzR/tree/master/src

thomasp85 commented 10 years ago

Yeah I realized it must be in the mzR source right after I hit the send button… I’ll go diggin’ tomorrow

On 05 Feb 2014, at 17:25, Laurent Gatto notifications@github.com wrote:

Do you have contact to some of the spc folks who might know of the inner workings of RAMP or do we need to dive into the TPP source code?

See https://github.com/sneumann/mzR/tree/master/src — Reply to this email directly or view it on GitHub.

sgibb commented 10 years ago

Hello Thomas, hello Laurent,

I am not quite sure whether I nailed it completely down. But it seems to be save to ignore the vendor specific nativeIDs and to do a match on the acquisitionNum and the last number present in the mzID/spectrumID.

The RAMP ignores the nativeID stuff (e.g. controllerType=0, controllerNumber=1, scan=xsd:positiveInteger for Thermo) and simply uses the scan number: mzR/src/pwiz/data/msdata/RAMPAdapter.cpp, ll. 141-211

void RAMPAdapter::Impl::getScanHeader(size_t index, ScanHeaderStruct& result, bool reservePeaks /*= true*/) const
{
    // ...
    result.acquisitionNum = getScanNumber(index); 
    // ...
}

mzR/src/pwiz/data/msdata/RAMPAdapter.cpp, ll. 126-138

int RAMPAdapter::Impl::getScanNumber(size_t index) const
{
    const SpectrumIdentity& si = msd_.run.spectrumListPtr->spectrumIdentity(index);
    string scanNumber = id::translateNativeIDToScanNumber(nativeIdFormat_, si.id);

    if (scanNumber.empty()) // unsupported nativeID type
    {
        // assume scanNumber is a 1-based index, consistent with this->index() method
        return static_cast<int>(index) + 1;
    } 
    else
        return lexical_cast<int>(scanNumber);
}

mzR/src/pwiz/data/msdata/MSData.cpp, ll. 552-580

PWIZ_API_DECL string translateNativeIDToScanNumber(CVID nativeIdFormat, const string& id)
{
    switch (nativeIdFormat)
    {
        case MS_spectrum_identifier_nativeID_format: // mzData
            return value(id, "spectrum");

        case MS_multiple_peak_list_nativeID_format: // MGF
            return value(id, "index");

        case MS_Agilent_MassHunter_nativeID_format:
            return value(id, "scanId");

        case MS_Thermo_nativeID_format:
            // conversion from Thermo nativeIDs assumes default controller information
            if (id.find("controllerType=0 controllerNumber=1") != 0)
                return "";

            // fall through to get scan

        case MS_Bruker_Agilent_YEP_nativeID_format:
        case MS_Bruker_BAF_nativeID_format:
        case MS_scan_number_only_nativeID_format:
            return value(id, "scan");

        default:
            return "";
    }
}

It seems that MS:1000770 (WIFF), MS:1000773 (Bruker FID) and MS:1000775 (single peak list) mentioned in section 5.1.3 Use of identifiers for input spectra to a search in the mzIdentML Specification Document are not supported yet (by PWIZ/RAMP).

IMHO something like that should be sufficient (at least for all ID formats supported by PWIZ/RAMP):

acquisitionNum <- header(mzML)$acquisitionNum
mzIdScanNum <- as.numeric(sub("^.*=([[:digit:]]+)$", "\\1",
                          flattenMzId$spectrumID))
m <- match(acquisitionNum, mzIdScanNum)

I will try to implement a prototype in the next days. Any comments?

Best wishes,

Sebastian

thomasp85 commented 10 years ago

Hi Sebastian

This is great news as it greatly reduces the complexity of the problem - furthermore I have implemented something similar in MSGFgui as a placeholder (albeit with a different regex), so it’s nice to know that it should be pretty stable.

I still believe this should be put in a new package that handles mzR mzID interfacing and common operations, so that mzR and mzID are kept at parsing raw data…

Thanks for looking into it!

best

Thomas

On 19 Mar 2014, at 13:58, Sebastian Gibb notifications@github.com wrote:

Hello Thomas, hello Laurent,

I am not quite sure whether I nailed it completely down. But it seems to be save to ignore the vendor specific nativeIDs and to do a match on the acquisitionNum and the last number present in the mzID/spectrumID.

The RAMP ignores the nativeID stuff (e.g. controllerType=0, controllerNumber=1, scan=xsd:positiveInteger for Thermo) and simply uses the scan number: mzR/src/pwiz/data/msdata/RAMPAdapter.cpp, ll. 141-211

void RAMPAdapter::Impl::getScanHeader(sizet index, ScanHeaderStruct& result, bool reservePeaks /= true_/) const { // ... result.acquisitionNum = getScanNumber(index); // ... } mzR/src/pwiz/data/msdata/RAMPAdapter.cpp, ll. 126-138

int RAMPAdapter::Impl::getScanNumber(sizet index) const { const SpectrumIdentity& si = msd.run.spectrumListPtr->spectrumIdentity(index); string scanNumber = id::translateNativeIDToScanNumber(nativeIdFormat_, si.id);

if (scanNumber.empty()) // unsupported nativeID type
{
    // assume scanNumber is a 1-based index, consistent with this->index() method
    return static_cast<int>(index) + 1;
} 
else
    return lexical_cast<int>(scanNumber);

} mzR/src/pwiz/data/msdata/MSData.cpp, ll. 552-580

PWIZ_API_DECL string translateNativeIDToScanNumber(CVID nativeIdFormat, const string& id) { switch (nativeIdFormat) { case MS_spectrum_identifier_nativeID_format: // mzData return value(id, "spectrum");

    case MS_multiple_peak_list_nativeID_format: // MGF
        return value(id, "index");

    case MS_Agilent_MassHunter_nativeID_format:
        return value(id, "scanId");

    case MS_Thermo_nativeID_format:
        // conversion from Thermo nativeIDs assumes default controller information
        if (id.find("controllerType=0 controllerNumber=1") != 0)
            return "";

        // fall through to get scan

    case MS_Bruker_Agilent_YEP_nativeID_format:
    case MS_Bruker_BAF_nativeID_format:
    case MS_scan_number_only_nativeID_format:
        return value(id, "scan");

    default:
        return "";
}

} It seems that MS:1000770 (WIFF), MS:1000773 (Bruker FID) and MS:1000775 (single peak list) mentioned in section 5.1.3 Use of identifiers for input spectra to a search in the mzIdentML Specification Document are not supported yet (by PWIZ/RAMP).

IMHO something like that should be sufficient (at least for all ID formats supported by PWIZ/RAMP):

acquisitionNum <- header(mzML)$acquisitionNum mzIdScanNum <- as.numeric(sub("^.*=([[:digit:]]+)$", "\1", flattenMzId$spectrumID)) m <- match(acquisitionNum, mzIdScanNum) I will try to implement a prototype in the next days. Any comments?

Best wishes,

Sebastian

— Reply to this email directly or view it on GitHub.

lgatto commented 10 years ago

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

thomasp85 commented 10 years ago

Certainly : )

My idea is to expose a single high level object that takes care of communicating between mzR and mzID objects and contains proteomics related methods useful for evaluating proteomic experiments or extend when building new proteomic packages.

Relevant methods includes several plots such as annotated MS2 spectra, parent ion EIC etc, as well as summary functions and getters and filters that incorporate information from both raw data and identification data.

This would potentially be a rather lightweight package but I don’t see this as a problem - I see lot of benefits in the future for this kind of class as the two data types often go hand in hand…

best

Thomas

On 20 Mar 2014, at 04:43, Laurent Gatto notifications@github.com wrote:

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

— Reply to this email directly or view it on GitHub.

lgatto commented 10 years ago

Certainly : )

My idea is to expose a single high level object that takes care of communicating between mzR and mzID objects and contains proteomics related methods useful for evaluating proteomic experiments or extend when building new proteomic packages.

Relevant methods includes several plots such as annotated MS2 spectra, parent ion EIC etc, as well as summary functions and getters and filters that incorporate information from both raw data and identification data.

That pretty much what MSnbase already does, just that the link with the identification data was not straightforward. But it will be now.

Laurent

This would potentially be a rather lightweight package but I don’t see this as a problem - I see lot of benefits in the future for this kind of class as the two data types often go hand in hand…

best

Thomas

On 20 Mar 2014, at 04:43, Laurent Gatto notifications@github.com wrote:

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

— Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub: https://github.com/lgatto/msidmatching/issues/2#issuecomment-38141316

thomasp85 commented 10 years ago

Well then theres less work : ) I’ll begin contributing there then…

best

Thomas

On 20 Mar 2014, at 09:33, Laurent Gatto notifications@github.com wrote:

Certainly : )

My idea is to expose a single high level object that takes care of communicating between mzR and mzID objects and contains proteomics related methods useful for evaluating proteomic experiments or extend when building new proteomic packages.

Relevant methods includes several plots such as annotated MS2 spectra, parent ion EIC etc, as well as summary functions and getters and filters that incorporate information from both raw data and identification data.

That pretty much what MSnbase already does, just that the link with the identification data was not straightforward. But it will be now.

Laurent

This would potentially be a rather lightweight package but I don’t see this as a problem - I see lot of benefits in the future for this kind of class as the two data types often go hand in hand…

best

Thomas

On 20 Mar 2014, at 04:43, Laurent Gatto notifications@github.com wrote:

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

— Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub: https://github.com/lgatto/msidmatching/issues/2#issuecomment-38141316 — Reply to this email directly or view it on GitHub.

thomasp85 commented 10 years ago

The reason why I didn’t think of this is that when I first read about MSnbase it was labelled as a package for labelled proteomics, which I don’t do - have the scope of the package moved beyond that since its release?

On 20 Mar 2014, at 09:33, Laurent Gatto notifications@github.com wrote:

Certainly : )

My idea is to expose a single high level object that takes care of communicating between mzR and mzID objects and contains proteomics related methods useful for evaluating proteomic experiments or extend when building new proteomic packages.

Relevant methods includes several plots such as annotated MS2 spectra, parent ion EIC etc, as well as summary functions and getters and filters that incorporate information from both raw data and identification data.

That pretty much what MSnbase already does, just that the link with the identification data was not straightforward. But it will be now.

Laurent

This would potentially be a rather lightweight package but I don’t see this as a problem - I see lot of benefits in the future for this kind of class as the two data types often go hand in hand…

best

Thomas

On 20 Mar 2014, at 04:43, Laurent Gatto notifications@github.com wrote:

Dear Thomas,

Could you clarify what you aims are with an mzID/mzR interface package - if it is just that one function, I think it might be a bit light. You probably have other plans.

Laurent

— Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub: https://github.com/lgatto/msidmatching/issues/2#issuecomment-38141316 — Reply to this email directly or view it on GitHub.

sgibb commented 10 years ago

In my opinion it would be the best to "translate" the nativeIDs into acquisitionNum in the mzID package. So it would be very easy to match spectra and identification information, e.g.: m <- match(header(mzML)$acquisitionNum, flattenMzId$spectrumID) would be enough.

Maybe it would also be good to rename the spectrumID column into acquisitionNum (maybe only if the translation was done) to avoid any confusion about different names/ids by the user.

Please see also my PR: https://github.com/thomasp85/mzID/pull/17

thomasp85 commented 10 years ago

Thats also a possibility, though in that case I would just add an addition column and keep spectrumID as is…

On 20 Mar 2014, at 11:31, Sebastian Gibb notifications@github.com wrote:

In my opinion it would be the best to "translate" the nativeIDs into acquisitionNum in the mzID package. So it would be very easy to match spectra and identification information, e.g.: m <- match(header(mzML)$acquisitionNum, flattenMzId$spectrumID) would be enough.

Maybe it would also be good to rename the spectrumID column into acquisitionNum (maybe only if the translation was done) to avoid any confusion about different names/ids by the user.

Please see also my PR: thomasp85/mzID#17

— Reply to this email directly or view it on GitHub.

sgibb commented 10 years ago

I think there is no need to add a new column. Nobody is interested in the nativeIDs if he uses mzR and mzID (and if he is, he could use translateNativeIDs=FALSE). IMHO it is just a waste of memory (ok, it is only 1 Mb but size sometimes matters :wink:):

fid <- flatten(mzID("Thermo_Hela_PRTC_1_MS2cent.mzid", translateNativeIDs=FALSE, verbose=FALSE))
fid2 <- flatten(mzID("Thermo_Hela_PRTC_1_MS2cent.mzid", translateNativeIDs=TRUE, verbose=FALSE))
print(object.size(fid$spectrumid), units="Mb")
1.2 Mb
print(object.size(fid2$spectrumid), units="Mb")
0.1 Mb
thomasp85 commented 10 years ago

It is mainly from the viewpoint that a parser should not change or remove existing data that gets parsed - per my reply to your PR i think it should be calculated at parsing, so that people are not limited to using flatten() for this feature and in that case removing spectrumID is changing the parsed data… don’t know whether this makes sense?

On 20 Mar 2014, at 11:46, Sebastian Gibb notifications@github.com wrote:

I think there is no need to add a new column. Nobody is interested in the nativeIDs if he uses mzR and mzID (and if he is, he could use translateNativeIDs=FALSE). IMHO it is just a waste of memory (ok, it is only 1 Mb but size sometimes matters ):

fid <- flatten(mzID("Thermo_Hela_PRTC_1_MS2cent.mzid", translateNativeIDs=FALSE, verbose=FALSE)) fid2 <- flatten(mzID("Thermo_Hela_PRTC_1_MS2cent.mzid", translateNativeIDs=TRUE, verbose=FALSE)) print(object.size(fid$spectrumid), units="Mb") 1.2 Mb print(object.size(fid2$spectrumid), units="Mb") 0.1 Mb — Reply to this email directly or view it on GitHub.