rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

Filtering order #184

Open lgatto opened 3 years ago

lgatto commented 3 years ago

The filterDataOrigin and filterDataStorage methods indicate that they

should return the data ordered by the provided dataOrigin parameter, i.e. if dataOrigin = c("2", "1") was provided, the spectra in the resulting object should be ordered accordingly (first spectra from data origin "2" and then from "1").

@jorainer - what is the rationale for this, and why is this so important that it is stated explicitly? The reason I am asking is that these requirements add some serious time/RAM requirements when performed on the SQLite backend.

cc @plantton

plantton commented 3 years ago

Related issue: https://github.com/r-dbi/RSQLite/issues/344

jorainer commented 3 years ago

It is stated explicitely because filterFile (where these functions stem from) don't support this, i.e. if you call filterFile( ... , file = c(2, 1)) is the same than filterFile(..., file = c(1, 2)). I wanted to have it implemented in that way, because I think it is important to allow the user to filter and re-order.

Hm, I find it strange that this causes performance issues with a SQL backend - in theory, if the SQL backend contains the primary keys in memory it should be just a subsetting and re-ordering of these primary keys and that shouldn't be too slow or memory demanding.

plantton commented 3 years ago

It is stated explicitely because filterFile (where these functions stem from) don't support this, i.e. if you call filterFile( ... , file = c(2, 1)) is the same than filterFile(..., file = c(1, 2)). I wanted to have it implemented in that way, because I think it is important to allow the user to filter and re-order.

Hm, I find it strange that this causes performance issues with a SQL backend - in theory, if the SQL backend contains the primary keys in memory it should be just a subsetting and re-ordering of these primary keys and that shouldn't be too slow or memory demanding.

But sorting the _pkey, the ordering of file = c(2, 1) is also required. Normally, we can use SQL ORDER BY clause to sort the tables in SQLite; or fetch a data.frame with _pkey and The String Column, then order this data frame in R.

But the prepared statement methods in either DBI or RSQLite don't support ORDER BY clause (the related issue). That is why we have this problem now.

jorainer commented 3 years ago

OK, maybe I need to have some more insights into your current implementation. Is the data stored in a single database or one database per file? And I guess the only information that is kept in the backend is the _pkey and all data, also the dataOrigin (I guess the dataStorage is just the one database were you store the data).