Should we have a uniqueMsLevel() accessor

lgatto commented 3 years ago

For some (large) datasets, calculating the unique MS levels can be tricky. For example, in the signature of filterRt() we have unique(msLevel(object)) which would become inefficient for an on-disk backend with large data.

There's such an implementation for the SQLite backend here, and I was wondering of we might not need to generalise this?

jorainer commented 3 years ago

sounds good to me - no objections against a uniqueMsLevel method defined for MsBackend. The one for MsBackend could essentially just unique(msLevel(object)) and each backend can implement its own if needed.

What I was however puzzled in the SQLite backend implementation was the need of a temporary table - wouldn't a simple select distinct... where _pkey = suffice there?

lgatto commented 3 years ago

Ping @plantton

plantton commented 3 years ago

sounds good to me - no objections against a uniqueMsLevel method defined for MsBackend. The one for MsBackend could essentially just unique(msLevel(object)) and each backend can implement its own if needed.

What I was however puzzled in the SQLite backend implementation was the need of a temporary table - wouldn't a simple select distinct... where _pkey = suffice there?

The issue is that the key word distinct again doesn't work with dbBind + prepared statements. The reason is still the same as this one, the queries would be actually passed to SQLite many times individually. The keyword DISTINCT was only implemented to each individual query.

fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
fls2 <- dir(system.file("proteomics", package = "msdata"), full.names = TRUE)[c(3,5)]
fls <- c(fls, fls2)
conn1 <- dbConnect(SQLite(), "b1.db")
b1 <- backendInitialize(MsBackendSqlDb(), dbcon = conn1, files = fls)

qry <- dbSendQuery(b1@dbcon, "SELECT DISTINCT msLevel FROM msdata WHERE _pkey = $pkey")
#> Warning: Closing open result set, pending rows
qry <- dbBind(qry, params = list(pkey = b1@rows))
rs1 <- dbFetch(qry)
dbClearResult(qry)
table(rs1)
#> rs1
#>    1    2    3 
#> 1950  933  482

rs2 <- dbGetQuery(b1@dbcon, "SELECT DISTINCT msLevel FROM msdata WHERE _pkey = $pkey",
                  params = list(pkey = b1@rows))
table(rs2)
#> rs2
#>    1    2    3 
#> 1950  933  482

rs3 <- dbGetQuery(b1@dbcon, "SELECT DISTINCT msLevel FROM msdata")
table(rs3)
#> rs3
#> 1 2 3 
#> 1 1 1

jorainer commented 3 years ago

OMG, now that's really a bummer. Good that you found this @plantton . Would something like

dbGetQuery(b1@dbcon, paste0("select distinct msLevel from msdata where _pkey in (", paste(b1@rows, collapse = ","),")"))

also work? I am OK with the temporary table, I just find it a little complicated (and eventually slower?)

plantton commented 3 years ago

I had a benchmark about the similar issue. Three functions are compared here:

Method A: SELECT _pkey, filterString FROM msdata, and the slot rows are passed by dbBind as prepared statement. While filterString are passed as paste(str, collapse = "', '").
Method B: Similar to A, but slot rows are passed as paste(rows, collapse = ", ").
Method C: Use TEMPORARY TABLE and table join to finish the task.

Method B is indeed faster than the other approaches.

But the critical issue of Method B is the peak RAM usage.

To find out the RAM consumption on a huge dataset, I made a linear regression to estimate the RAM usage of Method B on a dataset with 42592313 spectrum. The final RAM consumption is 1460.965 MiB.

There is a trade-off between velocity and RAM consumption. When it's impossible to use a few SQL keywords with dbBind, the better solution is to use the TEMP TABLE.

jorainer commented 3 years ago

Nice comparison! Thanks @plantton ! This huge memory requirement is somewhat strange. It must be the related to the concatenation of the primary keys... Could you please check what the peakRAM of the paste0 call on the primary keys is? Just to understand where this memory requirement comes from...

plantton commented 3 years ago

I think the concatenated integers are the reason. I tested both paste0 and paste methods:

demoRows <- seq(5)
s0 <- paste0(demoRows, collapse = ", ")
s1 <- paste(demoRows, collapse = ", ", sep = " ")

Then just use peakRAM to capture the RAM consumptions for the 2 methods.

For a dataset with 42592313 spectrum, the RAM consumption of paste0 is 1133 MiB.

While the RAM consumption of paste is 970 MiB.

And the RAM usage can be tested by another way:

ls1 <- list(seq(20000), seq(100000), seq(200000),
            seq(1000000), seq(4000000), seq(9000000))
object_size(paste(ls1[[6]], collapse = ", ", sep = " "))
## 79.9 MB
object_size(paste0(ls1[[6]], collapse = ", "))
## 79.9 MB

So it's not surprised to find the big dataset would use 1 Gb of RAM for the filter function.

lgatto commented 3 years ago

new generic added to ProtGenerics 1.23.9 (also pushed to Bioc)

plantton commented 3 years ago

ProtGenerics 1.23.9

Import in a new commit

rformassspectrometry / Spectra

Should we have a uniqueMsLevel() accessor #191