Closed lgatto closed 10 months ago
sounds good to me - no objections against a uniqueMsLevel
method defined for MsBackend
. The one for MsBackend
could essentially just unique(msLevel(object))
and each backend can implement its own if needed.
What I was however puzzled in the SQLite backend implementation was the need of a temporary table - wouldn't a simple select distinct... where _pkey =
suffice there?
Ping @plantton
sounds good to me - no objections against a
uniqueMsLevel
method defined forMsBackend
. The one forMsBackend
could essentially justunique(msLevel(object))
and each backend can implement its own if needed.What I was however puzzled in the SQLite backend implementation was the need of a temporary table - wouldn't a simple
select distinct... where _pkey =
suffice there?
The issue is that the key word distinct
again doesn't work with dbBind
+ prepared statements.
The reason is still the same as this one, the queries would be actually passed to SQLite many times individually. The keyword DISTINCT
was only implemented to each individual query.
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
fls2 <- dir(system.file("proteomics", package = "msdata"), full.names = TRUE)[c(3,5)]
fls <- c(fls, fls2)
conn1 <- dbConnect(SQLite(), "b1.db")
b1 <- backendInitialize(MsBackendSqlDb(), dbcon = conn1, files = fls)
qry <- dbSendQuery(b1@dbcon, "SELECT DISTINCT msLevel FROM msdata WHERE _pkey = $pkey")
#> Warning: Closing open result set, pending rows
qry <- dbBind(qry, params = list(pkey = b1@rows))
rs1 <- dbFetch(qry)
dbClearResult(qry)
table(rs1)
#> rs1
#> 1 2 3
#> 1950 933 482
rs2 <- dbGetQuery(b1@dbcon, "SELECT DISTINCT msLevel FROM msdata WHERE _pkey = $pkey",
params = list(pkey = b1@rows))
table(rs2)
#> rs2
#> 1 2 3
#> 1950 933 482
rs3 <- dbGetQuery(b1@dbcon, "SELECT DISTINCT msLevel FROM msdata")
table(rs3)
#> rs3
#> 1 2 3
#> 1 1 1
OMG, now that's really a bummer. Good that you found this @plantton . Would something like
dbGetQuery(b1@dbcon, paste0("select distinct msLevel from msdata where _pkey in (", paste(b1@rows, collapse = ","),")"))
also work? I am OK with the temporary table, I just find it a little complicated (and eventually slower?)
I had a benchmark about the similar issue. Three functions are compared here:
SELECT _pkey, filterString FROM msdata
, and the slot rows
are passed by dbBind
as prepared statement. While filterString
are passed as paste(str, collapse = "', '")
.rows
are passed as paste(rows, collapse = ", ")
. TEMPORARY TABLE
and table join to finish the task.Method B is indeed faster than the other approaches.
But the critical issue of Method B is the peak RAM usage.
To find out the RAM consumption on a huge dataset, I made a linear regression to estimate the RAM usage of Method B on a dataset with 42592313 spectrum. The final RAM consumption is 1460.965 MiB.
There is a trade-off between velocity and RAM consumption. When it's impossible to use a few SQL keywords with dbBind
, the better solution is to use the TEMP TABLE
.
Nice comparison! Thanks @plantton ! This huge memory requirement is somewhat strange. It must be the related to the concatenation of the primary keys... Could you please check what the peakRAM
of the paste0
call on the primary keys is? Just to understand where this memory requirement comes from...
I think the concatenated integers are the reason. I tested both paste0
and paste
methods:
demoRows <- seq(5)
s0 <- paste0(demoRows, collapse = ", ")
s1 <- paste(demoRows, collapse = ", ", sep = " ")
Then just use peakRAM
to capture the RAM consumptions for the 2 methods.
For a dataset with 42592313 spectrum, the RAM consumption of paste0
is 1133 MiB.
While the RAM consumption of paste
is 970 MiB.
And the RAM usage can be tested by another way:
ls1 <- list(seq(20000), seq(100000), seq(200000),
seq(1000000), seq(4000000), seq(9000000))
object_size(paste(ls1[[6]], collapse = ", ", sep = " "))
## 79.9 MB
object_size(paste0(ls1[[6]], collapse = ", "))
## 79.9 MB
So it's not surprised to find the big dataset would use 1 Gb of RAM for the filter function.
new generic added to ProtGenerics 1.23.9 (also pushed to Bioc)
ProtGenerics 1.23.9
Import in a new commit
For some (large) datasets, calculating the unique MS levels can be tricky. For example, in the signature of
filterRt()
we haveunique(msLevel(object))
which would become inefficient for an on-disk backend with large data.There's such an implementation for the SQLite backend here, and I was wondering of we might not need to generalise this?