Merging additional spectra variables

lgatto commented 3 years ago

Adding new spectra variables is surprisingly difficult. Here's a detailed and reproducible example of why it is difficult and what we need:

The test data

> library("magrittr")
> library("Spectra")
> ## Load the raw data
> ms <- Spectra(msdata::proteomics(pattern = "2014", full.names = TRUE)) %>%
+     filterMsLevel(2L) %>%
+     dropNaSpectraVariables()
> ms
MSn data (Spectra) with 6103 spectra in a MsBackendMzR backend:
       msLevel     rtime scanIndex
     <integer> <numeric> <integer>
1            2   202.354       227
2            2   203.069       228
3            2   203.491       229
4            2   203.960       230
5            2   204.492       231
...        ...       ...       ...
6099         2   3600.47      7530
6100         2   3600.83      7531
6101         2   3601.18      7532
6102         2   3601.57      7533
6103         2   3601.98      7534
 ... 28 more variables/columns.

file(s):
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz
Processing:
 Filter: select MS level(s) 2 [Wed Oct 21 09:56:29 2020] 
> spectraVariables(ms)
 [1] "msLevel"                 "rtime"                  
 [3] "acquisitionNum"          "scanIndex"              
 [5] "dataStorage"             "dataOrigin"             
 [7] "centroided"              "smoothed"               
 [9] "polarity"                "precScanNum"            
[11] "precursorMz"             "precursorIntensity"     
[13] "precursorCharge"         "collisionEnergy"        
[15] "isolationWindowLowerMz"  "isolationWindowTargetMz"
[17] "isolationWindowUpperMz"  "peaksCount"             
[19] "totIonCurrent"           "basePeakMZ"             
[21] "basePeakIntensity"       "ionisationEnergy"       
[23] "lowMZ"                   "highMZ"                 
[25] "injectionTime"           "filterString"           
[27] "spectrumId"              "scanWindowLowerLimit"   
[29] "scanWindowUpperLimit"

Easy case

It is straightforward to add a single new column of the right length.

> ms$foo <- rnorm(length(ms))
> spectraData(ms)[, c("rtime", "foo")]
DataFrame with 6103 rows and 2 columns
         rtime        foo
     <numeric>  <numeric>
1      202.354  0.0686001
2      203.069 -0.8235955
3      203.491 -1.5216832
4      203.960 -1.5233662
5      204.492 -0.6615719
...        ...        ...
6099   3600.47  -0.247136
6100   3600.83  -1.196412
6101   3601.18  -1.478410
6102   3601.57   0.974929
6103   3601.98  -0.629803

Less easy

We now have data for some rows only. Here, I need to initialise the column with the correct type and default value.

> foo2 <- rnorm(100)
> k <- sample(length(ms), 100)
> ms$foo2 <- NA_real_
> ms$foo2[k] <- foo2
> spectraData(ms)[k, c("rtime", "foo", "foo2")]
DataFrame with 100 rows and 3 columns
        rtime       foo      foo2
    <numeric> <numeric> <numeric>
1     1414.15 -0.313360  0.659918
2     3213.46  0.535001 -0.059523
3     3050.34 -0.576581 -0.909525
4     3071.25  1.802010 -1.251031
5     2872.13 -2.266166  1.044272
...       ...       ...       ...
96    3193.03 -0.844090  1.001866
97    1749.22 -0.944440 -0.521779
98    3444.82 -2.657677  0.270493
99    3420.76 -0.784582  2.230197
100   1657.72 -0.259139  0.840491

Not easy

Here's some real data to be merged:

> id <- MSnbase::readMzIdData(msdata::ident(full.names = TRUE)) %>%
+     dplyr::rename(spectrumId = spectrumID) %>%
+     DataFrame()
> id <- QFeatures::reduceDataFrame(id, id$spectrumId)
> id <- id[, c(1, 2, 22)]
> id
DataFrame with 5343 rows and 3 columns
                                                     sequence    spectrumId
                                              <CharacterList>   <character>
controllerType=0 controllerNumber=1 scan=1025       RKEQRLMVR controller...
controllerType=0 controllerNumber=1 scan=1027       RRQILMWRR controller...
controllerType=0 controllerNumber=1 scan=1029       RAKQMDRYR controller...
controllerType=0 controllerNumber=1 scan=1044   ERWVDIVNAD... controller...
controllerType=0 controllerNumber=1 scan=1045   FNLIEISNLA... controller...
...                                                       ...           ...
controllerType=0 controllerNumber=1 scan=985    IDHRSIRVQY... controller...
controllerType=0 controllerNumber=1 scan=986    THSQEEMQHM... controller...
controllerType=0 controllerNumber=1 scan=988    RNYKTNNLYL... controller...
controllerType=0 controllerNumber=1 scan=989    DWWQGIEQWR... controller...
controllerType=0 controllerNumber=1 scan=993    EYTVGYFLSC... controller...
                                              MS.GF.RawScore
                                                   <numeric>
controllerType=0 controllerNumber=1 scan=1025             21
controllerType=0 controllerNumber=1 scan=1027             23
controllerType=0 controllerNumber=1 scan=1029             29
controllerType=0 controllerNumber=1 scan=1044             -5
controllerType=0 controllerNumber=1 scan=1045             -2
...                                                      ...
controllerType=0 controllerNumber=1 scan=985              12
controllerType=0 controllerNumber=1 scan=986             141
controllerType=0 controllerNumber=1 scan=988              22
controllerType=0 controllerNumber=1 scan=989               2
controllerType=0 controllerNumber=1 scan=993              -6

Adding the spectrumId or MS.GF.RawScore variable is like the example above.
To add the sequence, I want to preserve the CharacterList class (relevant for later):

> k <- match(id$spectrumId, ms$spectrumId)
> ms$sequence2 <- List(as.list(rep(NA_character_, length(ms))))
> ms$sequence2[k] <- id$sequence
> >  spectraData(ms)[k, c("rtime", "sequence2")]
DataFrame with 5343 rows and 2 columns
         rtime       sequence2
     <numeric> <CharacterList>
1      1125.57       RKEQRLMVR
2      1126.24       RRQILMWRR
3      1126.97       RAKQMDRYR
4      1133.65   ERWVDIVNAD...
5      1134.32   FNLIEISNLA...
...        ...             ...
5339   1097.68   IDHRSIRVQY...
5340   1098.48   THSQEEMQHM...
5341   1100.11   RNYKTNNLYL...
5342   1100.92   DWWQGIEQWR...
5343   1104.14   EYTVGYFLSC...

Looks easy but...

Adding one column at a time isn't really an option, so let's try to merge tables. I'm going to use the merge (dplyr::left_join() would also apply).

> x <- merge(spectraData(ms), id, by = "spectrumId", all.x = TRUE)
> dim(x) 
[1] 6103   32
> length(ms)
[1] 6103
> x[k, c("rtime", names(id))]
DataFrame with 5343 rows and 4 columns
         rtime              sequence    spectrumId MS.GF.RawScore
     <numeric>                <list>   <character>      <numeric>
1      1227.76                    NA controller...             NA
2      1228.37 MPQLTYGKNR,EAYRIATIDR controller...             -5
3      1229.01               RYGWQLR controller...              3
4      1235.35                    NA controller...             NA
5      1235.64               KVEYLLR controller...            -23
...        ...                   ...           ...            ...
5339   1215.43               ERMQGRR controller...             27
5340   1215.73         YYFVFLFGLD... controller...              7
5341   1216.34             ARDFAIRRR controller...             31
5342   1216.66               MVLQSTR controller...             10
5343   1218.73            TREESREFAR controller...            -10

Even though this looks good, the type of the sequence variable has been changed from CharacterList to list, which isn't desirable.

spectraData(ms), id and the resulting x are all of class DFrame, and even though there is an dedicated merge method, is doesn't do what we need because it converts the DFrames to data.frames:

> showMethods("merge")
Function: merge (package base)
x="ANY", y="ANY"
x="data.frame", y="data.frame"
    (inherited from: x="ANY", y="ANY")
x="data.frame", y="DataTable"
x="DataTable", y="data.frame"
x="DataTable", y="DataTable"
x="DFrame", y="DFrame"
    (inherited from: x="DataTable", y="DataTable")
x="IntegerRangesList", y="IntegerRangesList"
x="IntegerRangesList", y="missing"
x="missing", y="IntegerRangesList"
x="missing", y="Seqinfo"
x="NULL", y="Seqinfo"
x="Seqinfo", y="missing"
x="Seqinfo", y="NULL"
x="Seqinfo", y="Seqinfo"
x="Vector", y="Vector"

> getMethod("merge", c("DataTable", "DataTable"))
Method Definition:

function (x, y, ...) 
{
    .local <- function (x, y, by, ...) 
    {
        if (is(by, "Hits")) {
            return(.mergeByHits(x, y, by, ...))
        }
        as(merge(as(x, "data.frame"), as(y, "data.frame"), by, 
            ...), class(x))
    }
    .local(x, y, ...)
}
<bytecode: 0x556dd0032ca8>
<environment: namespace:S4Vectors>

Signatures:
        x           y          
target  "DataTable" "DataTable"
defined "DataTable" "DataTable"

Update: we can coerce the list columns back to what we want with

> x$sequence <- as(x$sequence, "CharacterList")
> x[k, c("rtime", "sequence2", names(id))]
DataFrame with 5343 rows and 5 columns
         rtime             sequence2              sequence    spectrumId
     <numeric>                <list>       <CharacterList>   <character>
1      1227.76                    NA                    NA controller...
2      1228.37 MPQLTYGKNR,EAYRIATIDR MPQLTYGKNR,EAYRIATIDR controller...
3      1229.01               RYGWQLR               RYGWQLR controller...
4      1235.35                    NA                    NA controller...
5      1235.64               KVEYLLR               KVEYLLR controller...
...        ...                   ...                   ...           ...
5339   1215.43               ERMQGRR               ERMQGRR controller...
5340   1215.73         YYFVFLFGLD...         YYFVFLFGLD... controller...
5341   1216.34             ARDFAIRRR             ARDFAIRRR controller...
5342   1216.66               MVLQSTR               MVLQSTR controller...
5343   1218.73            TREESREFAR            TREESREFAR controller...
     MS.GF.RawScore
          <numeric>
1                NA
2                -5
3                 3
4                NA
5               -23
...             ...
5339             27
5340              7
5341             31
5342             10
5343            -10

Conclusions

We need a merge that preserves the column types (see the question on the bioc-devel list and this Github issue).
May be we could consider a joinSpectraData() function that updates the spectraData() and handles specific backends, rather than replying on manually `spectraData(x) <- x.

For point 1, may be first thing would be to ask on the Bioconductor devel list or slack, as I may be missing something.

jorainer commented 3 years ago

I personally like the joinSpectraData,Spectra as it tells exactly what we want to do - we want to add (columns) to the spectraData and leave the mz and intensity values untouched.

jorainer commented 3 years ago

And if possible, I would also implement that only for Spectra, and not in the backend. Implementing a backend from scratch turns out to be anyway quite extensive and I would not like to require even more functions to be implemented. The joinSpectraData,Spectra could e.g. use the $<- from the backend to add the columns.

lgatto commented 3 years ago

I would also implement that only for Spectra

I am not sure. I think we would need a default method for all backends that store their spectraData in memory (currently all but MsBackendSqlDb, but we will need backend-specific implementations when the spectraData lives on disk.

lgatto commented 3 years ago

A temporary solution is available in the joinSpectraData package.

jorainer commented 3 years ago

What I learned from implementing the MsBackendMassbankSql was that we have already a very large number of methods to implement (if the backend is not inheriting from e.g. MsBackendDataFrame). This was my reasoning why I would add new methods to the backend only as a last possibility. I thought that having already the $<-,MsBackend would be sufficient. It would not be ideal because one would have to add columns one by one, but it should work.

lgatto commented 3 years ago

I totally agree that priority should to implement methods at the Spectra.
Adding columns by columns isn't practical in practice. However, if I understand your suggestion, would be to implement to wrap the 'add one column at at time' approach in a joinSpectraData,Spectra,DataFrame method. This might actually be a good solution, given that we don't have a merge,DataFrame,DataFrame yet. I'll have a go at this.

lgatto commented 3 years ago

Not sure if a column by column would be feasible, given that initialising a new column

spectraData(x)[["foo"]] <- 1

is very slow.

On the other hand, a quick and dirty x@backend@spectraData[["bar"]] <- 1 is instantaneous :-)

Update Ok, what I think I need here is not $<-, but [[<-, so that I can use a string to refer to the variable name:

x[["foo"]]
x[["foo"]] <- 1

jorainer commented 3 years ago

Adding columns by columns isn't practical in practice. However, if I understand your suggestion, would be to implement to wrap the 'add one column at at time' approach in a joinSpectraData,Spectra,DataFrame method. This might actually be a good solution, given that we don't have a merge,DataFrame,DataFrame yet. I'll have a go at this.

That's exactly what I meant.

Yes, spectraData(x)[["foo"]] <- 1 will be slow because it replaces the whole spectra data.

Regarding the [[<-, I guess it should be possible to use the same code as $<- internally (or vice versa?). We could then have a default implementation for MsBackend so that this will not have to be implemented for all backends.

jorainer commented 3 years ago

Do you want to implement the [[<-,MsBackend or should I give it a try @lgatto ?

lgatto commented 3 years ago

The problem is that I can't use $<- programmatically, I need [[<-. I won't have time today, so feel tree to do it. The implementation of the two can be the same.

jorainer commented 3 years ago

OK, I'm on it. PR will follow soon.

lgatto commented 3 years ago

Here's the new implementation using [[<-,Spectra (assuming it will be added).

rformassspectrometry / Spectra