Reproduce the example from MSnbase vignette

plantton commented 3 years ago

When I'm using Spectra::joinSpectraData to reproduce the example from MSnbase vignette, I find the result is not the same as the original one.

Example from MSnbase vignette:

## find path to a mzXML file
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzXML$")
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "dummyiTRAQ.mzid")
## create basic MSnExp
msexp <- readMSData(quantFile, verbose = FALSE)
head(fData(msexp), n = 2)

msexp <- addIdentificationData(msexp, id = identFile)
head(fData(msexp), n = 2)

To reproduce the example by Spectra:

iddf <- readMzIdData(identFile)
iddf <- DataFrame(iddf)
iddf[ , 1:2]

DataFrame with 4 rows and 2 columns
           sequence  spectrumID
        <character> <character>
1     IDGQWVTHQWLKK      scan=2
2 VESITARHGEVLQLRPK      scan=1
3 IKPQAVIETLHRLTEGK      scan=1
4           LVILLFR      scan=5

Then use mzR() backend to load quantFile:

sps_ms <- Spectra(quantFile, backend = MsBackendMzR())
sps_ms$spectrumId

[1] "controllerType=0 controllerNumber=1 scan=1" "controllerType=0 controllerNumber=1 scan=2" "controllerType=0 controllerNumber=1 scan=3"
[4] "controllerType=0 controllerNumber=1 scan=4" "controllerType=0 controllerNumber=1 scan=5"

We use spectrumID to match identification DFrame to sps_ms, hence I changed the spectrumID in iddf:

iddf$spectrumID
[1] "scan=2" "scan=1" "scan=1" "scan=5"
iddf$spectrumID <- c("controllerType=0 controllerNumber=1 scan=2", 
                     "controllerType=0 controllerNumber=1 scan=1",
                     "controllerType=0 controllerNumber=1 scan=1",
                     "controllerType=0 controllerNumber=1 scan=5")

Now use joinSpectraData to add identification data to raw data:

sps_ms_joined <- joinSpectraData(sps_ms, iddf, 
                                 by.x = "spectrumId", by.y = "spectrumID")
fData(msexp)[ , c("spectrum", "sequence")]
      spectrum          sequence
F1.S1        1 VESITARHGEVLQLRPK
F1.S2        2     IDGQWVTHQWLKK
F1.S3        3              <NA>
F1.S4        4              <NA>
F1.S5        5           LVILLFR

spectraData(sps_ms_joined)$spectrumId
[1] "controllerType=0 controllerNumber=1 scan=1" "controllerType=0 controllerNumber=1 scan=2" "controllerType=0 controllerNumber=1 scan=3"
[4] "controllerType=0 controllerNumber=1 scan=4" "controllerType=0 controllerNumber=1 scan=5"

spectraData(sps_ms_joined)$sequence
[1] "IKPQAVIETLHRLTEGK" "IDGQWVTHQWLKK"     NA                  NA                  "LVILLFR"

The matched sequences for spectrum 1/scan = 1, are different in the two results. Am I wrong by using joinSpectraData here?

lgatto commented 3 years ago

Here is what I have. Slightly different approach than yours, which looks correct:

Load the data

> suppressPackageStartupMessages({
+ library(Spectra)
+ library(PSM)
+ })
> quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
+                 full.name = TRUE, pattern = "mzXML$")
> identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
+                 full.name = TRUE, pattern = "dummyiTRAQ.mzid")
> sp <- Spectra(quantFile, backend = MsBackendDataFrame())
> id <- filterPSMs(readPSMs(identFile))

Make sure we can match id and scans

> sp$spectrumId
[1] "controllerType=0 controllerNumber=1 scan=1"
[2] "controllerType=0 controllerNumber=1 scan=2"
[3] "controllerType=0 controllerNumber=1 scan=3"
[4] "controllerType=0 controllerNumber=1 scan=4"
[5] "controllerType=0 controllerNumber=1 scan=5"
> sp$spectrumId <- sub("^.+=1 ", "", sp$spectrumId)
> sp$spectrumId
[1] "scan=1" "scan=2" "scan=3" "scan=4" "scan=5"

Join and check:

> sp2 <- joinSpectraData(sp, id, by.y = "spectrumID")
> spectraData(sp2)[, c("sequence", "spectrumId")]
DataFrame with 5 rows and 2 columns
           sequence  spectrumId
        <character> <character>
1 VESITARHGEVLQLRPK      scan=1
2     IDGQWVTHQWLKK      scan=2
3                NA      scan=3
4                NA      scan=4
5           LVILLFR      scan=5
> id[, c("sequence", "spectrumID", "rank")]
DataFrame with 3 rows and 3 columns
           sequence  spectrumID      rank
        <character> <character> <integer>
1     IDGQWVTHQWLKK      scan=2         1
2 VESITARHGEVLQLRPK      scan=1         1
3           LVILLFR      scan=5         1

plantton commented 3 years ago

I think the difference is that you filtered the mzid file by filterPSMs firstly, then use the DFrame without duplicates for joinSpectraData. But in the vignette of MSnbase, addIdentificationData accepts the indentification data with duplicates.

So in the new join method, we are assumed to filter identification data before join it to Spectra object.

lgatto commented 3 years ago

MSnbase is a different package. Please refer to he joinSpectraData from the Spectra package.

The joinSpectraData is a general function, and it's the user that chooses what to join.

rformassspectrometry / Spectra

Reproduce the example from MSnbase vignette #172