rformassspectrometry / MsBackendMassbank

A prototype backend to parse MassBank records into Spectra
https://rformassspectrometry.github.io/MsBackendMassbank/
3 stars 5 forks source link

database-side filterPrecursorMz #28

Open meowcat opened 3 years ago

meowcat commented 3 years ago

Hi,

Since precursor values are only stored as text in the MassBank database, they cannot be used directly for queries. This makes filterPrecursorMz unacceptably slow with large databases in the current implementation.

My current workaround is client-side:

        # cache precursors
        ms2_precursors <- precursorMz(ms2_db)
        # ...
        range <- precursor_mz + c(-0.01, 0.01)
        ms2_candidates_ids <- which(between(ms2_precursors, range[[1]], range[[2]])
        ms2_candidates <- ms2_db[ms2_candidates_ids]

Still this carries the overhead of transferring all precursor m/z values when loading the DB (I could improve this by using SQL rather than spectraData, which pulls them individually by spectrum ID). Ideas:

What do you think, do you have other suggestions?

jorainer commented 3 years ago

Hm, I thought that I'm internally converting the precursor_mz_text into the precursorMz variable with as.numeric - that way you may loose stuff which is not encoded as a number, but has also some text in it. That shouldn't be that slow actually?

meowcat commented 3 years ago

On regular MassBank with <100k entries (the SQL query going from one docker to the other, on the same system)

> system.time(res <- filterPrecursorMz(ms2_db, c(299, 300)))
   user  system elapsed 
  2.672   5.474  48.983

On my LipidBlast MassBank with ~600k entries, this would take 3 minutes.

In my understanding, this uses the base MsBackend implementation of filterPrecursorMz, which 1) gets all precursorMz, which is performed by .spectra_data_massbank_sql -> .fetch_spectra_data_sql in this case. Then 2) it filters by m/z R-side.

My workaround caches the precursor m/z client-side. This still takes the same amount of time to fetch the precursors (~45sec for 100k MassBank, 3 min for 600k lipidblast, see below), but I have to do it once and not 3000 times (for every feature I search). The subsetting to select the appropriate candidates for every query is fast, of course.

INFO [2021-03-23 15:19:31] executing action: massbank
INFO [2021-03-23 15:19:31] massbank: MS2 search
INFO [2021-03-23 15:19:32] massbank: MS2 search - caching precursors
INFO [2021-03-23 15:19:55] massbank: MS2 search - search
SUCCESS [2021-03-23 15:21:32] massbank: MS2 search - 0 hits
[...]
INFO [2021-03-23 15:31:38] executing action: lipidblast
INFO [2021-03-23 15:31:38] lipidblast: MS2 search
INFO [2021-03-23 15:31:44] lipidblast: MS2 search - caching precursors
INFO [2021-03-23 15:34:29] lipidblast: MS2 search - search
SUCCESS [2021-03-23 15:37:35] lipidblast: MS2 search - 1206 hits

However it would be much faster and IMO more elegant to just filter precursors database-side (and form the intersection of spectrum_ids with any previous subsetting done client-side. I'm willing to work on this, since I need it, but don't want to intefere with anything you are working on...

meowcat commented 3 years ago

However it would be much faster and IMO more elegant to just filter precursors database-side

The "faster" part is at least my assumption. I still have to test it with a mockup.

meowcat commented 3 years ago

https://gist.github.com/meowcat/dd74c643c6aafd2fc7a17dec909d9737

meowcat commented 3 years ago

Without any annoying temp table magic on the R side, but an extra view on the MassBank side which can be cached (because no subselect) and so runs fast:

CREATE VIEW msms_precursor 
AS 
SELECT RECORD AS spectrum_id, CAST(MS_FOCUSED_ION.VALUE AS DOUBLE) AS precursor_mz 
from MS_FOCUSED_ION 
WHERE 
    MS_FOCUSED_ION.SUBTAG = 'PRECURSOR_M/Z' AND 
    MS_FOCUSED_ION.VALUE LIKE "%.%" AND 
    MS_FOCUSED_ION.VALUE NOT LIKE "% %" AND 
    MS_FOCUSED_ION.VALUE NOT LIKE "%/%"
jorainer commented 3 years ago

Honestly, I don't have a solution for this at present. I agree that filtering on the database side is faster, but you (or at least I as developer of Spectra) have to consider the following:

Still, as I said, I understand you completely. I'll have a thought on moving the filter... functions to the backend.

Regarding the view, I would then prefer to change the original view that we added and do the cast directly there (reporting the precursor_mz as a casted value and the "original" value as precursor_mz_text)

meowcat commented 3 years ago

Hi, now I am confused; the Spectra code clearly has filterPrecursorMz on the backend...

https://github.com/rformassspectrometry/Spectra/blob/1d047b263df2913b144c812335f0b8827b037951/R/MsBackend.R#L819


#' @exportMethod filterPrecursorMz
#'
#' @importMethodsFrom ProtGenerics filterPrecursorMz
#'
#' @rdname MsBackend
setMethod("filterPrecursorMz", "MsBackend",
          function(object, mz = numeric()) {
              if (length(mz)) {
                  mz <- range(mz)
                  keep <- which(precursorMz(object) >= mz[1] &
                                precursorMz(object) <= mz[2])
                  object[keep]
              } else object
          })
meowcat commented 3 years ago

Regarding the view, I would then prefer to change the original view that we added and do the cast directly there (reporting the precursor_mz as a casted value and the "original" value as precursor_mz_text)

This doesn't work as desired, unfortunately: the original view has a number of subselects and therefore doesn't cache, making it take 0.4 sec per view if I add the cast. The msms_precursor view I specifically made such that it doesn't need a subselect and therefore can be cached, it takes 50 ms on my system.

jorainer commented 3 years ago

Oh how embarassing :) Thanks for reminding me 👍

jorainer commented 3 years ago

Then feel free to implement that and make a PR once you're happy with it. And surely we can ask Rene to add this view.

meowcat commented 3 years ago

Working on this. The proposed method works well and very fast on 100k records; however for 1e6 records it takes almost 0.5s (mostly network time, since the SQL BETWEEN query itself is still 80 ms) so my client-side caching workaround is still faster than this if the number of queries is ~ >300.

meowcat commented 3 years ago

Implementation is done (https://github.com/meowcat/MsBackendMassbank), with a parameter for backendInitialize that specifies whether to cache precursors client-side (which makes the Spectra object larger, but faster). I will wait with a PR until the Massbank-side SQL change is discussed. Do you want any unit tests for this?

jorainer commented 3 years ago

Thanks for the update. Looking at your code I was wondering if it would not be easier to simply store the precursorMz into the localData DataFrame? The purpose of that one is anyway to keep all local data (e.g. new added spectra variables) - and you would also not have to take care of subsetting etc because that is already implemented. And if I remember correctly, when accessing spectra variables the functions first check if they are already in the @localData and only if they are not there they are fetched from the database.

So, I'm OK with the additional option to cache precursor m/z, but I would suggest to add them to the localData instead of having another slot. Also, since you're saying that this option is anyway faster, I'm wondering if we need the SQL-solution at all?

meowcat commented 3 years ago

Hi,

jorainer commented 3 years ago

localData

localData is thought to have the same number of rows than you have primary keys in the backend and also it's order is expected to be the same. Any subsetting operation etc will always correctly subset the backend (i.e. the primary keys and the localData). I would in the initializeBackend method simply extract all precursor m/z values from the data base (for the cache option), call as.numeric on them (will drop special cases replacing them with NA, but that's something problematic anyway) and then put them into the localData as column precursorMz. Could well be that with that you have already a built-in cache, because as I mentioned before, any call to a spectra variable should first check if the spectra variable is present in the localData and only if it is not go to the database.

uncached SQL vs cached precursors

I think memory will not be such an issue - at present I always change the backend from MsBackendMassbankSql to MsBackendDataFrame which considerably speeds up things.

meowcat commented 3 years ago

Hi,

I have implemented the localData solution now. Since only the numeric precursor m/z is stored now, and not the string spectrum_id, the extra memory is in fact very small, so I removed the non-cache solution, and as you said, I don't need the custom filterPrecursorMz anymore,

The current variant works with the pre-existing msms_spectrum view, as long as the msms_precursor view is not in MassBank yet. I also combined the query with the spectraIds query.

For <100k records msms_spectrum vs msms_precursor makes no practical difference (1.3 sec setup time). For 1e6 records it makes a difference even with combined query:

However in practical terms even the slow variants are still acceptable.

More optimization can be considered:

> system.time(spectraIds_fast <- dbGetQuery(con, "SELECT ACCESSION as spectrum_id FROM RECORD"))
   user  system elapsed 
   0.31    0.00    0.33  
> system.time(spectraIds <- dbGetQuery(con, "SELECT spectrum_id FROM msms_spectrum"))
   user  system elapsed 
   0.46    0.00   11.35 
> all(spectraIds == spectraIds_fast)
[1] TRUE

I added a unit test to check that caching works. Note: if we convert to DOUBLE on the DB side, the test will fail, because R will convert 333/222 to 333 whereas MySQL will convert it to NA and SQLite will convert it to 0. So if consistency is desired, the cached view will have to be adapted to return text format (even though this is slightly slower). One might also argue that there is no "correct" way to handle these cases and NA should always be the answer.

jorainer commented 3 years ago

Great (and sorry for the late reply)! Could you eventually make a pull request with the caching of precursor m/z in @localData?

meowcat commented 3 years ago

Yes, will do as soon as #29 is fixed so I can actually verify that my unit tests pass.

jorainer commented 3 years ago

I have a fix in. Just waiting that the GH actions run through.

jorainer commented 3 years ago

It's merged into the master branch