Open meowcat opened 3 years ago
Hm, I thought that I'm internally converting the precursor_mz_text
into the precursorMz
variable with as.numeric
- that way you may loose stuff which is not encoded as a number, but has also some text in it. That shouldn't be that slow actually?
On regular MassBank with <100k entries (the SQL query going from one docker to the other, on the same system)
> system.time(res <- filterPrecursorMz(ms2_db, c(299, 300)))
user system elapsed
2.672 5.474 48.983
On my LipidBlast MassBank with ~600k entries, this would take 3 minutes.
In my understanding, this uses the base MsBackend
implementation of filterPrecursorMz
, which 1) gets all precursorMz, which is performed by .spectra_data_massbank_sql
-> .fetch_spectra_data_sql
in this case. Then 2) it filters by m/z R-side.
My workaround caches the precursor m/z client-side. This still takes the same amount of time to fetch the precursors (~45sec for 100k MassBank, 3 min for 600k lipidblast, see below), but I have to do it once and not 3000 times (for every feature I search). The subsetting to select the appropriate candidates for every query is fast, of course.
INFO [2021-03-23 15:19:31] executing action: massbank
INFO [2021-03-23 15:19:31] massbank: MS2 search
INFO [2021-03-23 15:19:32] massbank: MS2 search - caching precursors
INFO [2021-03-23 15:19:55] massbank: MS2 search - search
SUCCESS [2021-03-23 15:21:32] massbank: MS2 search - 0 hits
[...]
INFO [2021-03-23 15:31:38] executing action: lipidblast
INFO [2021-03-23 15:31:38] lipidblast: MS2 search
INFO [2021-03-23 15:31:44] lipidblast: MS2 search - caching precursors
INFO [2021-03-23 15:34:29] lipidblast: MS2 search - search
SUCCESS [2021-03-23 15:37:35] lipidblast: MS2 search - 1206 hits
However it would be much faster and IMO more elegant to just filter precursors database-side (and form the intersection of spectrum_id
s with any previous subsetting done client-side. I'm willing to work on this, since I need it, but don't want to intefere with anything you are working on...
However it would be much faster and IMO more elegant to just filter precursors database-side
The "faster" part is at least my assumption. I still have to test it with a mockup.
Without any annoying temp table magic on the R side, but an extra view on the MassBank side which can be cached (because no subselect) and so runs fast:
CREATE VIEW msms_precursor
AS
SELECT RECORD AS spectrum_id, CAST(MS_FOCUSED_ION.VALUE AS DOUBLE) AS precursor_mz
from MS_FOCUSED_ION
WHERE
MS_FOCUSED_ION.SUBTAG = 'PRECURSOR_M/Z' AND
MS_FOCUSED_ION.VALUE LIKE "%.%" AND
MS_FOCUSED_ION.VALUE NOT LIKE "% %" AND
MS_FOCUSED_ION.VALUE NOT LIKE "%/%"
Honestly, I don't have a solution for this at present. I agree that filtering on the database side is faster, but you (or at least I as developer of Spectra
) have to consider the following:
Spectra
have to be the same independently of the backend used. If you use now a filtering on the SQL side there can be differences because of differences in precision of floats/doubles and you would also have to consider in addition the ppm
and tolerance
paramter.MsBackend
does not provide any filter...
methods - so this would be a hack that only works on this one backend.Still, as I said, I understand you completely. I'll have a thought on moving the filter...
functions to the backend.
Regarding the view, I would then prefer to change the original view that we added and do the cast directly there (reporting the precursor_mz as a casted value and the "original" value as precursor_mz_text)
Hi,
now I am confused; the Spectra
code clearly has filterPrecursorMz
on the backend...
#' @exportMethod filterPrecursorMz
#'
#' @importMethodsFrom ProtGenerics filterPrecursorMz
#'
#' @rdname MsBackend
setMethod("filterPrecursorMz", "MsBackend",
function(object, mz = numeric()) {
if (length(mz)) {
mz <- range(mz)
keep <- which(precursorMz(object) >= mz[1] &
precursorMz(object) <= mz[2])
object[keep]
} else object
})
Regarding the view, I would then prefer to change the original view that we added and do the cast directly there (reporting the precursor_mz as a casted value and the "original" value as precursor_mz_text)
This doesn't work as desired, unfortunately: the original view has a number of subselects and therefore doesn't cache, making it take 0.4 sec per view if I add the cast. The msms_precursor
view I specifically made such that it doesn't need a subselect and therefore can be cached, it takes 50 ms on my system.
Oh how embarassing :) Thanks for reminding me 👍
Then feel free to implement that and make a PR once you're happy with it. And surely we can ask Rene to add this view.
Working on this. The proposed method works well and very fast on 100k records; however for 1e6 records it takes almost 0.5s (mostly network time, since the SQL BETWEEN
query itself is still 80 ms) so my client-side caching workaround is still faster than this if the number of queries is ~ >300.
Implementation is done (https://github.com/meowcat/MsBackendMassbank), with a parameter for backendInitialize
that specifies whether to cache precursors client-side (which makes the Spectra
object larger, but faster). I will wait with a PR until the Massbank-side SQL change is discussed. Do you want any unit tests for this?
Thanks for the update. Looking at your code I was wondering if it would not be easier to simply store the precursorMz
into the localData
DataFrame
? The purpose of that one is anyway to keep all local data (e.g. new added spectra variables) - and you would also not have to take care of subsetting etc because that is already implemented. And if I remember correctly, when accessing spectra variables the functions first check if they are already in the @localData
and only if they are not there they are fetched from the database.
So, I'm OK with the additional option to cache precursor m/z, but I would suggest to add them to the localData
instead of having another slot. Also, since you're saying that this option is anyway faster, I'm wondering if we need the SQL-solution at all?
Hi,
localData
I have to look into this, since I haven't looked precisely at how localData
works. One (probably minor) issue is that not all database entries successfully return a numeric precursorMz
(some are MS1 spectra, some are incorrectly specified or special cases) and the order of the returned precursorMz
is not the same as the spectra. This would mean I have to do one join operation when getting the precursors. I have to test how slow this is for a million entries.
uncached SQL vs cached precursors In my own use case right now, I will probably use the cache anyway. However, this only pays off for a significant number of queries, since getting the cache from the view takes some ~10 seconds. It also needlessly doubles the size of the object, which may matter if unwitting users copy it around a couple of times without significantly subsetting it. I personally would leave both options in, but I leave it up to you to decide.
localData
localData
is thought to have the same number of rows than you have primary keys in the backend and also it's order is expected to be the same. Any subsetting operation etc will always correctly subset the backend (i.e. the primary keys and the localData
). I would in the initializeBackend
method simply extract all precursor m/z values from the data base (for the cache option), call as.numeric
on them (will drop special cases replacing them with NA
, but that's something problematic anyway) and then put them into the localData
as column precursorMz
. Could well be that with that you have already a built-in cache, because as I mentioned before, any call to a spectra variable should first check if the spectra variable is present in the localData
and only if it is not go to the database.
uncached SQL vs cached precursors
I think memory will not be such an issue - at present I always change the backend from MsBackendMassbankSql
to MsBackendDataFrame
which considerably speeds up things.
Hi,
I have implemented the localData
solution now. Since only the numeric precursor m/z is stored now, and not the string spectrum_id, the extra memory is in fact very small, so I removed the non-cache solution, and as you said, I don't need the custom filterPrecursorMz
anymore,
The current variant works with the pre-existing msms_spectrum
view, as long as the msms_precursor
view is not in MassBank yet. I also combined the query with the spectraIds
query.
For <100k records msms_spectrum
vs msms_precursor
makes no practical difference (1.3 sec setup time). For 1e6 records it makes a difference even with combined query:
Spectra
without cachingspectraIds
and precursorMz
separatelyspectraIds
from msms_spectrum
and get precursorMz
from the fast msms_precursor
.However in practical terms even the slow variants are still acceptable.
More optimization can be considered:
> system.time(spectraIds_fast <- dbGetQuery(con, "SELECT ACCESSION as spectrum_id FROM RECORD"))
user system elapsed
0.31 0.00 0.33
> system.time(spectraIds <- dbGetQuery(con, "SELECT spectrum_id FROM msms_spectrum"))
user system elapsed
0.46 0.00 11.35
> all(spectraIds == spectraIds_fast)
[1] TRUE
I added a unit test to check that caching works. Note: if we convert to DOUBLE on the DB side, the test will fail, because R will convert 333/222
to 333
whereas MySQL will convert it to NA
and SQLite will convert it to 0
. So if consistency is desired, the cached view will have to be adapted to return text format (even though this is slightly slower). One might also argue that there is no "correct" way to handle these cases and NA
should always be the answer.
Great (and sorry for the late reply)! Could you eventually make a pull request with the caching of precursor m/z in @localData
?
Yes, will do as soon as #29 is fixed so I can actually verify that my unit tests pass.
I have a fix in. Just waiting that the GH actions run through.
It's merged into the master branch
Hi,
Since precursor values are only stored as text in the MassBank database, they cannot be used directly for queries. This makes
filterPrecursorMz
unacceptably slow with large databases in the current implementation.My current workaround is client-side:
Still this carries the overhead of transferring all precursor m/z values when loading the DB (I could improve this by using SQL rather than
spectraData
, which pulls them individually by spectrum ID). Ideas:VIEW
:CREATE VIEW msms_spectrum_precursor AS SELECT *, CAST(precursor_mz_text AS FLOAT) AS precursor_mz FROM msms_spectrum
, which could then be used to query precursor values. Of course this does not help speed much, since the CAST is done on every call. (Either way this would have to be done on the MassBank side.)CREATE TEMPORARY TABLE
) with precursor m/z when accessing the DB.What do you think, do you have other suggestions?