database-side filterPrecursorMz

meowcat commented 3 years ago

Hi,

Since precursor values are only stored as text in the MassBank database, they cannot be used directly for queries. This makes filterPrecursorMz unacceptably slow with large databases in the current implementation.

My current workaround is client-side:

        # cache precursors
        ms2_precursors <- precursorMz(ms2_db)
        # ...
        range <- precursor_mz + c(-0.01, 0.01)
        ms2_candidates_ids <- which(between(ms2_precursors, range[[1]], range[[2]])
        ms2_candidates <- ms2_db[ms2_candidates_ids]

Still this carries the overhead of transferring all precursor m/z values when loading the DB (I could improve this by using SQL rather than spectraData, which pulls them individually by spectrum ID). Ideas:

I tried making a VIEW: CREATE VIEW msms_spectrum_precursor AS SELECT *, CAST(precursor_mz_text AS FLOAT) AS precursor_mz FROM msms_spectrum, which could then be used to query precursor values. Of course this does not help speed much, since the CAST is done on every call. (Either way this would have to be done on the MassBank side.)
Creating an auxiliary table on the MassBank side where the precursor is in numeric format, we would have to discuss with René Meier
Creating a temporary SQL table (CREATE TEMPORARY TABLE) with precursor m/z when accessing the DB.

What do you think, do you have other suggestions?

jorainer commented 3 years ago

Hm, I thought that I'm internally converting the precursor_mz_text into the precursorMz variable with as.numeric - that way you may loose stuff which is not encoded as a number, but has also some text in it. That shouldn't be that slow actually?

meowcat commented 3 years ago

On regular MassBank with <100k entries (the SQL query going from one docker to the other, on the same system)

> system.time(res <- filterPrecursorMz(ms2_db, c(299, 300)))
   user  system elapsed 
  2.672   5.474  48.983

On my LipidBlast MassBank with ~600k entries, this would take 3 minutes.

In my understanding, this uses the base MsBackend implementation of filterPrecursorMz, which 1) gets all precursorMz, which is performed by .spectra_data_massbank_sql -> .fetch_spectra_data_sql in this case. Then 2) it filters by m/z R-side.

My workaround caches the precursor m/z client-side. This still takes the same amount of time to fetch the precursors (~45sec for 100k MassBank, 3 min for 600k lipidblast, see below), but I have to do it once and not 3000 times (for every feature I search). The subsetting to select the appropriate candidates for every query is fast, of course.

INFO [2021-03-23 15:19:31] executing action: massbank
INFO [2021-03-23 15:19:31] massbank: MS2 search
INFO [2021-03-23 15:19:32] massbank: MS2 search - caching precursors
INFO [2021-03-23 15:19:55] massbank: MS2 search - search
SUCCESS [2021-03-23 15:21:32] massbank: MS2 search - 0 hits
[...]
INFO [2021-03-23 15:31:38] executing action: lipidblast
INFO [2021-03-23 15:31:38] lipidblast: MS2 search
INFO [2021-03-23 15:31:44] lipidblast: MS2 search - caching precursors
INFO [2021-03-23 15:34:29] lipidblast: MS2 search - search
SUCCESS [2021-03-23 15:37:35] lipidblast: MS2 search - 1206 hits

However it would be much faster and IMO more elegant to just filter precursors database-side (and form the intersection of spectrum_ids with any previous subsetting done client-side. I'm willing to work on this, since I need it, but don't want to intefere with anything you are working on...

meowcat commented 3 years ago

However it would be much faster and IMO more elegant to just filter precursors database-side

The "faster" part is at least my assumption. I still have to test it with a mockup.

meowcat commented 3 years ago

https://gist.github.com/meowcat/dd74c643c6aafd2fc7a17dec909d9737

meowcat commented 3 years ago

Without any annoying temp table magic on the R side, but an extra view on the MassBank side which can be cached (because no subselect) and so runs fast:

CREATE VIEW msms_precursor 
AS 
SELECT RECORD AS spectrum_id, CAST(MS_FOCUSED_ION.VALUE AS DOUBLE) AS precursor_mz 
from MS_FOCUSED_ION 
WHERE 
    MS_FOCUSED_ION.SUBTAG = 'PRECURSOR_M/Z' AND 
    MS_FOCUSED_ION.VALUE LIKE "%.%" AND 
    MS_FOCUSED_ION.VALUE NOT LIKE "% %" AND 
    MS_FOCUSED_ION.VALUE NOT LIKE "%/%"

jorainer commented 3 years ago

Honestly, I don't have a solution for this at present. I agree that filtering on the database side is faster, but you (or at least I as developer of Spectra) have to consider the following:

the results of any subsetting on a Spectra have to be the same independently of the backend used. If you use now a filtering on the SQL side there can be differences because of differences in precision of floats/doubles and you would also have to consider in addition the ppm and tolerance paramter.
MsBackend does not provide any filter... methods - so this would be a hack that only works on this one backend.

Still, as I said, I understand you completely. I'll have a thought on moving the filter... functions to the backend.

Regarding the view, I would then prefer to change the original view that we added and do the cast directly there (reporting the precursor_mz as a casted value and the "original" value as precursor_mz_text)

meowcat commented 3 years ago

Hi, now I am confused; the Spectra code clearly has filterPrecursorMz on the backend...

https://github.com/rformassspectrometry/Spectra/blob/1d047b263df2913b144c812335f0b8827b037951/R/MsBackend.R#L819


#' @exportMethod filterPrecursorMz
#'
#' @importMethodsFrom ProtGenerics filterPrecursorMz
#'
#' @rdname MsBackend
setMethod("filterPrecursorMz", "MsBackend",
          function(object, mz = numeric()) {
              if (length(mz)) {
                  mz <- range(mz)
                  keep <- which(precursorMz(object) >= mz[1] &
                                precursorMz(object) <= mz[2])
                  object[keep]
              } else object
          })

meowcat commented 3 years ago

Regarding the view, I would then prefer to change the original view that we added and do the cast directly there (reporting the precursor_mz as a casted value and the "original" value as precursor_mz_text)

This doesn't work as desired, unfortunately: the original view has a number of subselects and therefore doesn't cache, making it take 0.4 sec per view if I add the cast. The msms_precursor view I specifically made such that it doesn't need a subselect and therefore can be cached, it takes 50 ms on my system.

jorainer commented 3 years ago

Oh how embarassing :) Thanks for reminding me 👍

jorainer commented 3 years ago

Then feel free to implement that and make a PR once you're happy with it. And surely we can ask Rene to add this view.

meowcat commented 3 years ago

Working on this. The proposed method works well and very fast on 100k records; however for 1e6 records it takes almost 0.5s (mostly network time, since the SQL BETWEEN query itself is still 80 ms) so my client-side caching workaround is still faster than this if the number of queries is ~ >300.

meowcat commented 3 years ago

Implementation is done (https://github.com/meowcat/MsBackendMassbank), with a parameter for backendInitialize that specifies whether to cache precursors client-side (which makes the Spectra object larger, but faster). I will wait with a PR until the Massbank-side SQL change is discussed. Do you want any unit tests for this?

jorainer commented 3 years ago

Thanks for the update. Looking at your code I was wondering if it would not be easier to simply store the precursorMz into the localData DataFrame? The purpose of that one is anyway to keep all local data (e.g. new added spectra variables) - and you would also not have to take care of subsetting etc because that is already implemented. And if I remember correctly, when accessing spectra variables the functions first check if they are already in the @localData and only if they are not there they are fetched from the database.

So, I'm OK with the additional option to cache precursor m/z, but I would suggest to add them to the localData instead of having another slot. Also, since you're saying that this option is anyway faster, I'm wondering if we need the SQL-solution at all?

meowcat commented 3 years ago

Hi,

localData I have to look into this, since I haven't looked precisely at how localData works. One (probably minor) issue is that not all database entries successfully return a numeric precursorMz (some are MS1 spectra, some are incorrectly specified or special cases) and the order of the returned precursorMz is not the same as the spectra. This would mean I have to do one join operation when getting the precursors. I have to test how slow this is for a million entries.
uncached SQL vs cached precursors In my own use case right now, I will probably use the cache anyway. However, this only pays off for a significant number of queries, since getting the cache from the view takes some ~10 seconds. It also needlessly doubles the size of the object, which may matter if unwitting users copy it around a couple of times without significantly subsetting it. I personally would leave both options in, but I leave it up to you to decide.

jorainer commented 3 years ago

localData

localData is thought to have the same number of rows than you have primary keys in the backend and also it's order is expected to be the same. Any subsetting operation etc will always correctly subset the backend (i.e. the primary keys and the localData). I would in the initializeBackend method simply extract all precursor m/z values from the data base (for the cache option), call as.numeric on them (will drop special cases replacing them with NA, but that's something problematic anyway) and then put them into the localData as column precursorMz. Could well be that with that you have already a built-in cache, because as I mentioned before, any call to a spectra variable should first check if the spectra variable is present in the localData and only if it is not go to the database.

uncached SQL vs cached precursors

I think memory will not be such an issue - at present I always change the backend from MsBackendMassbankSql to MsBackendDataFrame which considerably speeds up things.

meowcat commented 3 years ago

Hi,

I have implemented the localData solution now. Since only the numeric precursor m/z is stored now, and not the string spectrum_id, the extra memory is in fact very small, so I removed the non-cache solution, and as you said, I don't need the custom filterPrecursorMz anymore,

The current variant works with the pre-existing msms_spectrum view, as long as the msms_precursor view is not in MassBank yet. I also combined the query with the spectraIds query.

For <100k records msms_spectrum vs msms_precursor makes no practical difference (1.3 sec setup time). For 1e6 records it makes a difference even with combined query:

18 sec are required to build the Spectra without caching
26 sec are used for caching with the combined query.
36 sec are used if I query spectraIds and precursorMz separately
18 sec are used if I query spectraIds from msms_spectrum and get precursorMz from the fast msms_precursor.

However in practical terms even the slow variants are still acceptable.

More optimization can be considered:

> system.time(spectraIds_fast <- dbGetQuery(con, "SELECT ACCESSION as spectrum_id FROM RECORD"))
   user  system elapsed 
   0.31    0.00    0.33  
> system.time(spectraIds <- dbGetQuery(con, "SELECT spectrum_id FROM msms_spectrum"))
   user  system elapsed 
   0.46    0.00   11.35 
> all(spectraIds == spectraIds_fast)
[1] TRUE

I added a unit test to check that caching works. Note: if we convert to DOUBLE on the DB side, the test will fail, because R will convert 333/222 to 333 whereas MySQL will convert it to NA and SQLite will convert it to 0. So if consistency is desired, the cached view will have to be adapted to return text format (even though this is slightly slower). One might also argue that there is no "correct" way to handle these cases and NA should always be the answer.

jorainer commented 3 years ago

Great (and sorry for the late reply)! Could you eventually make a pull request with the caching of precursor m/z in @localData?

meowcat commented 3 years ago

Yes, will do as soon as #29 is fixed so I can actually verify that my unit tests pass.

jorainer commented 3 years ago

I have a fix in. Just waiting that the GH actions run through.

jorainer commented 3 years ago

It's merged into the master branch

rformassspectrometry / MsBackendMassbank

database-side filterPrecursorMz #28