Store sample annotation to the database

jorainer commented 11 months ago

Maybe also store sample annotations into the database, such that an MsExperiment could be directly loaded from the database.

Question is however a) if that should go in here or into the MsExperiment package b) if only sample annotations should be stored or also the linkage to the spectra

Maybe implement an additional MsExperimentOfflineSql class that extends the MsBackendSql but in addition provides the sample annotations? or is that overengineering?

jorainer commented 11 months ago

Easiest solution would be to have a helper function that allows to write sample data to an existing MsBackendSql database and a function that allows to load/get an MsExperiment from such an (extended) database.

lgatto commented 11 months ago

Not sure if sample annotation requires an on-disk solution - that sounds over-kill, IMHO.
Sample annotations aren't supposed to come with the Spectra data, so I would prefer not to have an exception for the SQL back-end.

... I don't get why having sample as part of MsBackendSql.

jorainer commented 11 months ago

totally agree - I'm just playing a bit with that idea.

Q: Why I'm doing that? A: I have now some of our data sets as SQL databases, which is nice, but in addition I also always need to have an e.g. xls sheet with sample annotations. I need then to load that file in addition to connecting to the database to get my MsExperiment for the data analysis. I would find it much more convenient if I could get everything already from the database (this would not be an on-disk sample annotation solution, it would just read a database table and putting that into the @sampleData of the MsExperiment. Advantage would be to have all info from an experiment in a single place.

Let me maybe play a bit with that and then we can make a dev call to discuss.

jorainer commented 10 months ago

Here comes some description/use case that we could maybe discuss in a dev call:

Support for extraction of full experiment from database

Currently, readMsExperiment is designed to create a MsExperiment with MS data read/imported from mzML files, i.e. the use of a MsBackendMzR backend. Creating a MsExperiment with another backend is not easily possible.

What I would suggest is to make the readMsExperiment a method to dispatch on the parameter spectraBackend. The default implementation would use MsBackendMzR, but this would allow implementation for other backends (such as the MsBackendSql too:

#' Generic should go to ProtGenerics
setGeneric("readMsExperiment", function(spectraBackend, ...)
    standardGeneric("readMsExperiment"))

#' These are the "default" implementation and should go to MsExperiment.
setMethod("readMsExperiment", "missing",
          function(spectraBackend, spectraFiles = character(),
                   sampleData = data.frame(), ...) {
              MsExperiment::readMsExperiment(spectraFiles = spectraFiles,
                                             sampleData = sampleData)
          })
setMethod("readMsExperiment", "character", function(spectraBackend, ...) {
    MsExperiment::readMsExperiment(spectraFiles = spectraBackend, ...)
})
setMethod("readMsExperiment", "MsBackend",
          function(spectraBackend, spectraFiles = character(),
                   sampleData = data.frame(), ...) {
              MsExperiment::readMsExperiment(spectraFiles = spectraFiles,
                                             sampleData = sampleData)
          })

There could then be specific implementation for certain MsBackend classes (defined in the respective R package) that would simplify reading MS experiment data. An implementation for MsBackendSql is shown below. Different use cases for that functions are shown further below.

setMethod("readMsExperiment", "MsBackendSql",
          function(spectraBackend, spectraFiles = character(),
                   sampleData = data.frame(), ...) {
              ## initialize backend - should throw error if not all required
              ## informations are provided.
              be <- backendInitialize(spectraBackend, ...)
              map <- matrix(nrow = 0, ncol = 2)
              if (!(is.data.frame(sampleData) ||
                    inherits(sampleData, "DataFrame")))
                  stop("'sampleData' is expected to be a 'data.frame' ",
                       "or 'DataFrame'")
              if (length(spectraFiles) || nrow(sampleData)) {
                  ## Link samples to spectra using provided spectraFiles
                  ## and dataOrigin from the database.
                  if (length(spectraFiles) != nrow(sampleData))
                      stop("If provided, length of 'spectraFiles' needs to ",
                           "match the number of rows of 'sampleData'.")
                  map <- findMatches(basename(spectraFiles),
                                     basename(dataOrigin(be)))
                  map <- cbind(from(map), to(map))
              } else {
                  con <- dbconn(be)
                  if (inherits(be, "MsBackendOfflineSql"))
                      on.exit(dbDisconnect(con))
                  if (.db_contains_sample_data(con)) {
                      sampleData <- dbGetQuery(con, "select * from sample_data")
                      map <- unname(as.matrix(dbGetQuery(
                          con, "select * from sample_to_msms_spectrum")))
                  }
              }
              res <- MsExperiment::MsExperiment()
              res@spectra <- Spectra(be)
              res@sampleData <- as(sampleData, "DataFrame")
              if (nrow(map) > 0) {
                  res@sampleDataLinks[["spectra"]] <- map
                  mcols(res@sampleDataLinks)["spectra", "subsetBy"] <- 1L
              } else
                  warning("Could not derive mapping between samples and ",
                          "spectra. Please use 'linkSampleData' to establish ",
                          "that mapping.")
              res
          })

Preparing data for use cases:

defining sample annotations
creating a MsBackendSql database

library(Spectra)
library(MsBackendSql)
library(RSQLite)

mm8_file <- system.file("microtofq", "MM8.mzML", package = "msdata")
mm14_file <- system.file("microtofq", "MM14.mzML", package = "msdata")
sd <- data.frame(file = basename(c(mm8_file, mm14_file)),
                 sample_name = c("MM8", "MM14"),
                 batch = c("2021-11-12", "2021-12-11"),
                 injection_index = c(2L, 5L),
                 sample_source = c("plasma", "serum"))

## Now, storing the data to a MsBackensSql
mm_sqlite <- tempfile()
createMsBackendSqlDatabase(dbConnect(SQLite(), mm_sqlite),
                           c(mm8_file, mm14_file), blob = TRUE)

The standard way to import MS data from e.g. the mzML files would be:

## Import from raw data files
mse <- readMsExperiment(spectraBackend = MsBackendMzR(),
                        spectraFiles = c(mm8_file, mm14_file),
                        sampleData = sd)
mse
Object of class MsExperiment 
 Spectra: MS1 (310) 
 Experiment data: 2 sample(s)
 Sample data links:
  - spectra: 2 sample(s) to 310 element(s).
## The same
mse <- readMsExperiment(c(mm8_file, mm14_file), sampleData = sd)
mse
Object of class MsExperiment 
 Spectra: MS1 (310) 
 Experiment data: 2 sample(s)
 Sample data links:
  - spectra: 2 sample(s) to 310 element(s).

The implementation for MsBackendSql simplifies the use with this type of backend (at the very bottom is an example how that needs to be done at present, i.e. without the proposed changes).

## "Read" an MsExperiment with data from that backend. `spectraFiles` is
## used to define the mapping between samples and spectra (using `dataOrigin`).
## Additional parameters are passed to the backendInitialize method of
## MsBackendOfflineSql
mse <- readMsExperiment(MsBackendOfflineSql(), sampleData = sd,
                        spectraFiles = c(mm8_file, mm14_file),
                        drv = SQLite(), dbname = mm_sqlite)
mse
Object of class MsExperiment 
 Spectra: MS1 (310) 
 Experiment data: 2 sample(s)
 Sample data links:
  - spectra: 2 sample(s) to 310 element(s).

What this would enable in addition is to store also sample annotations directly to the MsBackendSql database. Storing sample annotations together with the raw MS data has the advantage that information for one experiment is all bundled together (-> data integrity!). For self-contained storage modes (such as a SQLite database file or any other SQL database) that has the clear advantage that a whole experiment could be shared as a single file.

My proposal would be to store sample annotations in the same database (but in separate database tables). This would not interfere with the standard use of the backend.

Below we write the sample annotation to an existing MsBackendSql database. This needs to be done only once, ideally right after the database was created using the createMsBackendSqlDatabase function above.

be <- backendInitialize(MsBackendOfflineSql(), dbname = mm_sqlite,
                        drv = SQLite())
be$file <- basename(dataOrigin(be))
writeSampleData(be, sampleData = sd, colname = "file", spectraVariable = "file")

As a side effect, these SQL databases could also be used by other tools as it is a simple, plain SQL database.

dbListTables(dbconn(be))
[1] "msms_spectrum"           "msms_spectrum_peak_blob" "sample_data"            
[4] "sample_to_msms_spectrum"

The implementation of the readMsExperiment method for MsBackendSql could then retrieve also sample annotation from the database if present.

mse <- readMsExperiment(MsBackendOfflineSql(), dbname = mm_sqlite,
                        drv = SQLite())
mse
Object of class MsExperiment 
 Spectra: MS1 (310) 
 Experiment data: 2 sample(s)
 Sample data links:
  - spectra: 2 sample(s) to 310 element(s).

As a comparison, creating a MsExperiment with a MsBackendSql backend would be way less user friendly (and also with a higher change of errors):

library(MsExperiment)

mse <- MsExperiment()

## Add Spectra
be <- backendInitialize(MsBackendOfflineSql(), drv = SQLite(),
                        dbname = mm_sqlite)
sps <- Spectra(be)
sps$file <- basename(dataOrigin(sps))
spectra(mse) <- sps

## Add samples
sampleData(mse) <- as(sd, "DataFrame")

## Link samples to spectra
mse <- linkSampleData(mse, with = "sampleData.file = spectra.file")
mse

Happy to discuss :)

jorainer commented 9 months ago

Closing this issue as this was implemented in MsExperiment.

rformassspectrometry / MsBackendSql

Store sample annotation to the database #14

Support for extraction of full experiment from database