Store spectra m/z and intensity values as blob

jorainer commented 6 years ago

After the first test with the full HMDB database (~450.000 spectra) it turned out that saving the spectra data in two tables msms_spectrum_peak and msms_spectrum_metadata could be an overkill. Specifically, the msms_spectrum_peak table, that has one row for each m/z - intensity pair (peak) is huge (over 8 million rows) and fetching the full data takes several minutes (mostly because the two tables have to be joined).

An alternative approach would be to store each MS/MS spectrum as one row in a msms_spectrum table, with the m/z and intensity values saved as BLOB. This would however prevent searches/subsets by m/z and intensity values directly in the database (using SQL) - something that might anyway not be that interesting.

I'll check the performance of both approaches.

stanstrup commented 6 years ago

I guess it could be interesting to ask "give me all compounds where this mz appears"...

Is several minutes a big deal? 8 million rows doesn't seem so bad once it is in memory.

Or did you mean simply storing it in the database in that format but make the long format as when it is read?

jorainer commented 6 years ago

Two things: 1) 8 million rows is for now. I expect this to increase over time with more and more spectra being added - or for other databases than HMDB that might have more spectra. 2) at present it is a 28 second query to retrieve all spectra. Well, that's way faster than importing it from the files, but still - users are usually not very patient.

Or did you mean simply storing it in the database in that format but make the long format as when it is read?

It it the storage of the data - in the end it's not that simple, because I have to serialize/deserialize each individual field when storing/retrieving it from the db. The idea is then to have the expanded version returned (e.g. as a Spectrum2List). But having the data as a BLOB prevents to have SQL queries in the form of where mz > 23.129 and mz < 23.2 to retrieve only specific e.g. compounds. One would have to fetch the full data first and then subset instead of letting SQLite doing the subsetting.

stanstrup commented 6 years ago

Is storing it as a blob even faster?

jorainer commented 6 years ago

getting the full table with m/z and intensity values as a blob takes ~ 2 seconds. But this has then only ~ 450.000 rows, one row per spectrum and has to be deserialized. Still it is faster.

Coming back to your possibly interesting question: so it would be interesting to pose the question "give me all compounds with a MS/MS spectrum that has a peak at a certain m/z"? How frequently you think such a query would be?

jorainer commented 6 years ago

Update on the timing it takes to fetch the MS/MS data from the database: I compared the two functions, the first joins the two tables with the MS/MS peak data (m/z and intensity values) and the MS/MS spectrum annotations and returns the full data. The second queries the table with one line per spectrum and the m/z and intensity value vectors stored as a BLOB for each spectrum. This function deserializes also the blob and expands the returning data.frame. The result from both approaches is the same data.frame with 8240863 rows.

Unit: seconds
                    expr      min       lq     mean   median       uq      max
    dbGetQuery(con, q_a) 27.82240 28.98545 29.61636 30.14851 30.51334 30.87817
 .msms_spectra(con, q_b) 14.27698 14.96500 15.46380 15.65302 16.05721 16.46141
 neval cld
     3   b
     3  a

If the results are supposed to be returned as a Spectrum2List the data.frames have to be further processed. Here the second approach has the advantage that the result from the query does only have to be deserialized but not expanded.

Unit: seconds
                                                   expr       min        lq
                    Spectrum2List(dbGetQuery(con, q_a)) 323.01007 332.65474
 Spectrum2List(.msms_spectra(con, q_b, expand = FALSE))  25.70828  26.67545
      mean    median        uq       max neval cld
 344.18153 342.29940 354.76726 367.23511     3   b
  27.11244  27.64262  27.81453  27.98643     3  a

So no question here who's the winner. Saving the m/z and intensity vectors per spectrum as BLOB outperforms storing them as individual numeric values. This is mostly because the huge data.frame has to be split per spectrum for the individual-value-saving while the BLOB way returns the data.frame already in a form that can be almost immediately converted into the Spectrum2List.

Now the same performance tests for the retrieval of spectra for a single compound:

Unit: milliseconds
                                                                expr      min
    dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")) 1.347068
 .msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'")) 1.047436
       lq     mean   median       uq      max neval cld
 1.371294 1.438904 1.395520 1.484822 1.574123     3   a
 1.060169 1.189017 1.072901 1.259807 1.446714     3   a

BLOB is slightly faster, but no big difference. Returning a SpectrumList instead, the BLOB approach is again faster:

Unit: milliseconds
                                                                                                    expr
                         Spectrum2List(dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")))
 Spectrum2List(.msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'"),      expand = FALSE))
      min       lq     mean   median      uq      max neval cld
 5.184055 5.230218 5.267699 5.276380 5.30952 5.342661     3   b
 3.461168 3.574361 3.824816 3.687553 4.00664 4.325726     3  a

Summarizing, if there is no strong reason that we need direct access to the m/z and intensity values of the individual spectra I would go for the BLOB solution.

stanstrup commented 6 years ago

Then by all means, Sir! Blob it! :relaxed:

jorainer commented 6 years ago

m/z and intensity values for a MS/MS spectrum are now stored as BLOB in the msms_spectrum database table. Advantage is that we have now only a single table for MS/MS spectra, with each row containing the data for one spectrum.

If in future we see the need/necessity to search for spectra based on m/z values (in SQL) we can still change back to the two-table approach. I made sure that only moderate code changes are necessary for that.

SiggiSmara commented 6 years ago

As a side note, one could think about generating mz-range index table (e.g. +/- 10 A.U. with a 1/2 width A.U. overlap) for the spectra that would speed up fetching relevant spectra based on the input of mz values without having to go fully unblobbed (is that a word?)

The use case I am thinking of is similarity search / comparison were you have a spectra and would like to find the best n matches. For LC/MS this would almost entirely happen on the MS2 or higher level which already would reduce the possible spectra by a lot since they would have to have the same or similar parent ion. For GC/MS though you would want to perform similarity searches also on MS1 level data.

jorainer commented 6 years ago

Thanks for the input @SiggiSmara . Once we have real use cases we can then check what makes more sense, do the search with SQL within the database, or get all spectra and do the comparison in R.

The advantage of the second approach would be that we don't have to reimplement the wheel. If I got it correctly there is already spectrum comparison functionality in MSnbase that we could use here - that's why I'm exporting the spectrum data also as Spectrum2 objects (in this fancy Spectrum2List object).

SiggiSmara commented 6 years ago

Agree completely that no need to reinvent the wheel, I was more thinking of how many spectra you theoretically have to perform such similarity calculations on and if a range index would help you automatically exclude spectra that are not even close to meeting the criteria on the SQL side. But as you pointed out it will become more clearer if this is an issue or not once people start using the db.

michaelwitting commented 6 years ago

Hello,

I'm trying to implement also new functions for comparing spectra, which can be used with the compareSpectra function from MSnbase. This includes a modified dot-product according to GNPS (with constant mass shift) and X-Rank. If you have other ideas for similarity calculations please let me know. Also planning to have a forward and reverse dot-product.

Michael

Von: "Sigurður Smárason" notifications@github.com An: "EuracBiomedicalResearch/CompoundDb" CompoundDb@noreply.github.com CC: "Subscribed" subscribed@noreply.github.com Gesendet: Freitag, 10. August 2018 10:12:21 Betreff: Re: [EuracBiomedicalResearch/CompoundDb] Store spectra m/z and intensity values as blob (#26)

Agree completely that no need to reinvent the wheel, I was more thinking of how many spectra you theoretically have to perform such similarity calculations on and if a range index would help you automatically exclude spectra that are not even close to meeting the criteria on the SQL side. But as you pointed out it will become more clearer if this is an issue or not once people start using the db.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, [ https://github.com/EuracBiomedicalResearch/CompoundDb/issues/26#issuecomment-412010341 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACm6csT9jm4bmjN0Xd08RY4Wi7ezXUVNks5uPUBlgaJpZM4V1Iyi | mute the thread ] .

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. med. Dr. h. c. Matthias H. Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

jorainer commented 6 years ago

Sounds great @michaelwitting 👍

rformassspectrometry / CompoundDb

Store spectra m/z and intensity values as blob #26