Closed jorainer closed 6 years ago
I guess it could be interesting to ask "give me all compounds where this mz appears"...
Is several minutes a big deal? 8 million rows doesn't seem so bad once it is in memory.
Or did you mean simply storing it in the database in that format but make the long format as when it is read?
Two things: 1) 8 million rows is for now. I expect this to increase over time with more and more spectra being added - or for other databases than HMDB that might have more spectra. 2) at present it is a 28 second query to retrieve all spectra. Well, that's way faster than importing it from the files, but still - users are usually not very patient.
Or did you mean simply storing it in the database in that format but make the long format as when it is read?
It it the storage of the data - in the end it's not that simple, because I have to serialize/deserialize each individual field when storing/retrieving it from the db. The idea is then to have the expanded version returned (e.g. as a Spectrum2List
). But having the data as a BLOB prevents to have SQL queries in the form of where mz > 23.129 and mz < 23.2
to retrieve only specific e.g. compounds. One would have to fetch the full data first and then subset instead of letting SQLite doing the subsetting.
Is storing it as a blob even faster?
getting the full table with m/z and intensity values as a blob takes ~ 2 seconds. But this has then only ~ 450.000 rows, one row per spectrum and has to be deserialized. Still it is faster.
Coming back to your possibly interesting question: so it would be interesting to pose the question "give me all compounds with a MS/MS spectrum that has a peak at a certain m/z"? How frequently you think such a query would be?
Update on the timing it takes to fetch the MS/MS data from the database: I compared the two functions, the first joins the two tables with the MS/MS peak data (m/z and intensity values) and the MS/MS spectrum annotations and returns the full data. The second queries the table with one line per spectrum and the m/z and intensity value vectors stored as a BLOB
for each spectrum. This function deserializes also the blob and expands the returning data.frame
. The result from both approaches is the same data.frame
with 8240863 rows.
Unit: seconds
expr min lq mean median uq max
dbGetQuery(con, q_a) 27.82240 28.98545 29.61636 30.14851 30.51334 30.87817
.msms_spectra(con, q_b) 14.27698 14.96500 15.46380 15.65302 16.05721 16.46141
neval cld
3 b
3 a
If the results are supposed to be returned as a Spectrum2List
the data.frame
s have to be further processed. Here the second approach has the advantage that the result from the query does only have to be deserialized but not expanded.
Unit: seconds
expr min lq
Spectrum2List(dbGetQuery(con, q_a)) 323.01007 332.65474
Spectrum2List(.msms_spectra(con, q_b, expand = FALSE)) 25.70828 26.67545
mean median uq max neval cld
344.18153 342.29940 354.76726 367.23511 3 b
27.11244 27.64262 27.81453 27.98643 3 a
So no question here who's the winner. Saving the m/z and intensity vectors per spectrum as BLOB
outperforms storing them as individual numeric values. This is mostly because the huge data.frame
has to be split per spectrum for the individual-value-saving while the BLOB
way returns the data.frame
already in a form that can be almost immediately converted into the Spectrum2List
.
Now the same performance tests for the retrieval of spectra for a single compound:
Unit: milliseconds
expr min
dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")) 1.347068
.msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'")) 1.047436
lq mean median uq max neval cld
1.371294 1.438904 1.395520 1.484822 1.574123 3 a
1.060169 1.189017 1.072901 1.259807 1.446714 3 a
BLOB
is slightly faster, but no big difference. Returning a SpectrumList
instead, the BLOB
approach is again faster:
Unit: milliseconds
expr
Spectrum2List(dbGetQuery(con, paste0(q_a, " where compound_id='HMDB0000001'")))
Spectrum2List(.msms_spectra(con, paste0(q_b, " where compound_id='HMDB0000001'"), expand = FALSE))
min lq mean median uq max neval cld
5.184055 5.230218 5.267699 5.276380 5.30952 5.342661 3 b
3.461168 3.574361 3.824816 3.687553 4.00664 4.325726 3 a
Summarizing, if there is no strong reason that we need direct access to the m/z and intensity values of the individual spectra I would go for the BLOB
solution.
Then by all means, Sir! Blob it! :relaxed:
m/z and intensity values for a MS/MS spectrum are now stored as BLOB
in the msms_spectrum
database table. Advantage is that we have now only a single table for MS/MS spectra, with each row containing the data for one spectrum.
If in future we see the need/necessity to search for spectra based on m/z values (in SQL) we can still change back to the two-table approach. I made sure that only moderate code changes are necessary for that.
As a side note, one could think about generating mz-range index table (e.g. +/- 10 A.U. with a 1/2 width A.U. overlap) for the spectra that would speed up fetching relevant spectra based on the input of mz values without having to go fully unblobbed (is that a word?)
The use case I am thinking of is similarity search / comparison were you have a spectra and would like to find the best n matches. For LC/MS this would almost entirely happen on the MS2 or higher level which already would reduce the possible spectra by a lot since they would have to have the same or similar parent ion. For GC/MS though you would want to perform similarity searches also on MS1 level data.
Thanks for the input @SiggiSmara . Once we have real use cases we can then check what makes more sense, do the search with SQL within the database, or get all spectra and do the comparison in R
.
The advantage of the second approach would be that we don't have to reimplement the wheel. If I got it correctly there is already spectrum comparison functionality in MSnbase
that we could use here - that's why I'm exporting the spectrum data also as Spectrum2
objects (in this fancy Spectrum2List
object).
Agree completely that no need to reinvent the wheel, I was more thinking of how many spectra you theoretically have to perform such similarity calculations on and if a range index would help you automatically exclude spectra that are not even close to meeting the criteria on the SQL side. But as you pointed out it will become more clearer if this is an issue or not once people start using the db.
Hello,
I'm trying to implement also new functions for comparing spectra, which can be used with the compareSpectra function from MSnbase. This includes a modified dot-product according to GNPS (with constant mass shift) and X-Rank. If you have other ideas for similarity calculations please let me know. Also planning to have a forward and reverse dot-product.
Michael
Von: "Sigurður Smárason" notifications@github.com An: "EuracBiomedicalResearch/CompoundDb" CompoundDb@noreply.github.com CC: "Subscribed" subscribed@noreply.github.com Gesendet: Freitag, 10. August 2018 10:12:21 Betreff: Re: [EuracBiomedicalResearch/CompoundDb] Store spectra m/z and intensity values as blob (#26)
Agree completely that no need to reinvent the wheel, I was more thinking of how many spectra you theoretically have to perform such similarity calculations on and if a range index would help you automatically exclude spectra that are not even close to meeting the criteria on the SQL side. But as you pointed out it will become more clearer if this is an issue or not once people start using the db.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, [ https://github.com/EuracBiomedicalResearch/CompoundDb/issues/26#issuecomment-412010341 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACm6csT9jm4bmjN0Xd08RY4Wi7ezXUVNks5uPUBlgaJpZM4V1Iyi | mute the thread ] .
Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. med. Dr. h. c. Matthias H. Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671
Sounds great @michaelwitting 👍
After the first test with the full HMDB database (~450.000 spectra) it turned out that saving the spectra data in two tables
msms_spectrum_peak
andmsms_spectrum_metadata
could be an overkill. Specifically, themsms_spectrum_peak
table, that has one row for each m/z - intensity pair (peak) is huge (over 8 million rows) and fetching the full data takes several minutes (mostly because the two tables have to be joined).An alternative approach would be to store each MS/MS spectrum as one row in a
msms_spectrum
table, with the m/z and intensity values saved asBLOB
. This would however prevent searches/subsets by m/z and intensity values directly in the database (using SQL) - something that might anyway not be that interesting.I'll check the performance of both approaches.