Open SiggiSmara opened 7 years ago
There have been already some discussions on that. A strong argument for me to stick to mzML and mzXML is that they are relatively stable standard formats and I would avoid converting the MS data to yet another file format that is not cross platform and cross software compatible.
Regarding speed, I have compared once (indexed) SQLite database access time and access times from an original mzML file format and I did not see a large difference. I admit that this was just one simple test but since I didn't gain any significant speed improvement I didn't investigate further.
Note: there is also https://github.com/thomasp85/MSsary that did take a similar avenue saving intermediates in SQLite fils.
I could imagine that we think about that in future - but at present there is other stuff that has a higher priority.
While mzR
is the format that we used (not because it is elegant and efficient, but because it is widely known and used), I think it would be interesting to have support for SQLite-based MS data. I met with the mzDB
author some time ago, and I mentioned my interest, but I don't think the C++ bindings were complete or documented at the time. Also, I don't remember of the conversion to mzDB
was straightforward. I have never seen any file in YAFMS
.
I wouldn't have any time to start anything like that, but could help out if somebody else took the lead.
Re MSsary
, it is stalled, and I think that Thomas has other interests now.
It is definitely a valid concern to introduce another format for MS data given the history of open standard MS data formats. That said, everyone is using the current formats for the reasons both of you mention, namely stable, known and used, not because they necessarily fit the purpose of data analysis.
I haven't been entirely convinced on the two approaches I mentioned above, and I haven't looked at MSsary at all so can't comment on their approach. But I think if we put our heads together and possibly get some others that we know are thinking about these things we might be able to come up with some design objectives of a new format that has the focus on data analysis.
I'd me more than happy to chip in as much as I can or take the lead if necessary. Just be aware that I am a hack, not a programmer. I have a fairly good knowledge and experience with databases, both open source and commercial ones, in designing and using them but my programming for the most part has been in python and other scripting languages (php anyone? I almost don't dare to mention such things in public). @jotsetung can attest that my R skills are very much on the beginners side if present at all.
My current thinking is that a) it should be perhaps investigated if this improves anything in a more systematic way and b) if this is deemed something worth putting in place, to focus first on the usefulness for mzR/XCMS/MSnbase and secondly to think about if this is something useful for the general field of MS data analysis or beyond.
I could dig out my tests I've done last year to compare SQLite vs mzML and do some more checks. If this would increase speed we could think of a third mode = "inSQLite"
or something similar, i.e. during data processing the data is stored in SQLite
file format. But there, updating values in a SQLite database can be quite time consuming too, especially if there are indices in place.
I would definitely not want to define a new standard format. Better to have something in place that works with MSnbase
and (most importantly) can be exported to mzML
at any time.
I would suggest using previously published data sets that have been used for speed testing. One such data set is found in the MS-numpres paper. Direct link to it is here http://webdav.swegrid.se/snic/bils/lu_proteomics/pub/ms-numpress/. Another perhaps ignorant but related question, is reading MS numpress-ed data supported in mzR?
I'm not an advocate for a new standard (see points above), but in order to find a format that is an improvement in speed and possibly size for our work I do think it is necessary to spend time to come up with a good solution. And I agree that it should be possible to export to mzML.
Hi there,
Good news guys. We will soon update our C reader for mzDB files (https://github.com/mzdb/libmzdb). A student is helping me to implement some Rcpp binding on top of libmzdb. We are currently facing some issues regarding the usage of Rcpp with C code (mainly regarding the usage of C structs). So any help on this side would be welcome.
Next week we will commit our new version of libmzdb and the draft of our R package stupidly named libmzdbR.
Great new, @david-bouyssie
Next week we will commit our new version of libmzdb and the draft of our R package stupidly named
libmzdbR
.
I would recommend libmzdbr
- changing case has proven to be challenging for users :-)
I'm looking forward to try it out.
@lgatto ok thank you for the naming suggestion
Good news guys ;-)
The draft of our libzmdb R bindings is now on github: https://github.com/mzdb/rmzdb
This project is still experimental, but we expect to have a working version for the end of the month. If you have any advice, feel free to create an issue on the corresponding repo.
Thanks to @ValentinCamus work, we finally have our first working version of rmzdb :)
Little things that still need to be done:
Currently unit testing is done using the Perl bindings. But it won't be too difficult to port these tests to R.
In the meantime we can discuss the possible integration in mzR.
Have a nice summer,
David
Some complementary information regarding previous remarks:
A strong argument for me to stick to mzML and mzXML is that they are relatively stable standard formats and I would avoid converting the MS data to yet another file format that is not cross platform and cross software compatible.
The mzDB specs (https://github.com/mzdb/mzdb-specs) are unchanged since 3 years, so you can consider them as stable. We use the format in production in our lab since 2014. The new libmzdb C library aims to provide a cross-platform solution to access mzDB files. It's is not yet feature complete but should be very soon.
Regarding speed, I have compared once (indexed) SQLite database access time and access times from an original mzML file format and I did not see a large difference.
SQLite doesn't really shine when dealing with full MS spectra. The main improvements of mzDB compared to *ML formats are:
I would definitely not want to define a new standard format. Better to have something in place that works with MSnbase and (most importantly) can be exported to mzML at any time.
The mzDB format is partially based on mzML. If you open an mzDB file using an SQLite viewer, you'll see that meta-data are encoded as XML chunks following the mzML specification. In other terms mzDB tables can be aligned with mzML nodes, except for the storage of spectra data which are stored in specific binary structures. Actually we had a tool called mzDB2mzML (https://github.com/mzdb/pwiz-mzdb) but this tool has been disabled from our build for technical reasons. FYI the build of this tool is in listed in our current roadmap. However, pwiz-mzdb relies on ProteoWizard so as you know it's not really lightweight. Thus I'm now thinking that porting mzDB2mzML to libmzdb/rmzdb could be a much better solution.
Thank you @david-bouyssie and @ValentinCamus - looks very interesting. A few quick questions and comments.
mzdb
(from mzML
for instance)rmzdb
package. One thing that you might want to consider it to write an R interface to it. By that, I mean an abstraction in R, a set of R function that call the Rcpp module in the background. But see next commentMSnbase
level
x <- readMSData("x.mzdb", mode = "mzDB")
and then use the existing MSnbase
interface to access the data.
x[[10]] ## 10th spectrun
x <- addIdentificationData(x, "x.mzid")
...
This would make it completely transparent for existing users.
PS: There could even be a
x <- readMSData(c("x1.mzML", "x2.mzML", ...), mode = "mzDB")
that would do the conversion to mzDB
under the hood.
Hi Laurent,
Could you elaborate on the conversion to mzdb (from mzML for instance)
There is no problem to convert from mzML
to mzDB
, at least in profile mode.
Indeed, in the case of Thermo data, peak picking works better if you convert directly from the raw
file.
However converting from mzDB
to mzML
requires to fix the tool named mzDB2mzML
. It's in our roadmap but was flagged as low priority until now.
I had a quick look at the rmzdb package. One thing that you might want to consider it to write an R interface to it. By that, I mean an abstraction in R, a set of R function that call the Rcpp module in the background.
I thought that this abstraction would be managed by mzR, which would then expose mzDB through its own API:
mz <- openMSfile(file, backend = "mzDB")
If I understand what you are suggesting, you don't want to integrate the mzDB
reader as a new mzR
backend but rather as a new MSnbase
data source. Did I get it correctly?
How I understood it (please correct me @david-bouyssie if I'm wrong), the idea would be to first convert the files from raw into mzDB
(instead of mzML
) and then load and analyze these files with MSnbase
and others. For that it would indeed make more sense to add the functionality to handle these files in mzR
.
There would then be a new backend in mzR
with the methods header
, peaks
, spectra
etc implemented for mzDB
files as you suggested @david-bouyssie . Only, if I get it correctly, the main benefit from the mzDB
files is the improved indexing: while mzML
indexes only the retention time, the mzDB
indexes both the retention time and the m/z (again, please correct me if I'm wrong here). Problem is that the mzR::spectra
methods use only the retention time/scan index for the subsetting, so, without larger changes, we would loose the benefits from faster access by m/z subsetting. Currently, for mzML and CDF files we get a speed improvement selecting only spectra from certain rt ranges/scan indexes but have to filter the full spectrum by m/z afterwards. The mzDB
backend could support both filters in one call.
How I understood it (please correct me @david-bouyssie if I'm wrong), the idea would be to first convert the files from raw into
mzDB
(instead ofmzML
)
It would useful to also be able to convert from mzML
to mzDB
and then load and analyze these files with
MSnbase
and others. For that it would indeed make more sense to add the functionality to handle these files inmzR
.
I don't think we need to add anything to mzR
(I would prefer not to, it's complicated enough as it is). MSnbase
could depend on rmzdb
for the SQLite
backend and provide the interface, as it does for mzR
.
@jotsetung: I do agree with all your comments/suggestions. Can't summarize better ;)
@lgatto: could we imagine an intermediate solution? We could create a package XXX that depends on mzR and rmzdb and that would act as an mzR proxy. This package would expose a subset of the mzR API (useful for MSnbase) + some useful access methods optimized for mzDB (like XIC access for instance). Then MSnbase could depend on this intermediate package. This solution would imply no modificiation of the mzR codebase but would provide more flexibility in the case you don't want the MSnbase features and only want to use the MS data directly. Note that instrument vendors like Bruker and AbSciex plan to deliver SQLite based formats in the future (the wiff2 format for AbSciex and the TimsTOF Pro specific format for Bruker). Bruker will provide a C based library to access their data. So we could also imagine in the future that this XXX package could also wraps the access to these other SQLite based formats. And at the some points if this wrapping package is mature enough then we could think about merging its codebase with the mzR one.
For the integration in MSnbase we will still need a small glue that would call the XXX package API to consume the data.
The main disadvantage of this proxy solution is "API maintenance": if a new method is added to mzR and is required by MSnbase then the proxy has to be updated to expose this new method. Plus, I know that mzR is already an abstraction layer over different readers and its never good to do the same thing twice. But in this specific case I think it's worth a try.
I think it would be cleanest to implement rmzdb
support right into mzR
as this is supposed to be the main interface package for MS data files. That would also significantly simplify the integration into MSnbase
.
Having yet another low level data access package would be IMHO problematic: while inMem
should be OK, the onDisk
mode would be more complicated because of the different low level interfaces that would have to be used. For that I see 2 solutions:
1) Define yet another object similar to OnDiskMSnExp
that handles data retrieval from the rmzdb
.
2) Use a lot of if
conditions in the data access functions to switch between the mzR
and the mzdb
backends.
I'm not really happy with both options which makes me believe the mzR
integration (although mzR
is already heavily overloaded) to be the better solution.
Re converting mzML to mzDb, that would be nice, but I haven't spotted any export functionality yet in rmzdb
.
I don't see why we would need another package. rmzdb
should depend on ProtGenerics
(and possibly mzR
) and implement the relevant accessor methods (like mzR
does) as well as it's own ones, peculiar to that backend. MSnbase
can then implement (using OnDiskMnExp
or, probably better, a new backend) a high level SQLite interface by depending (suggesting actually) rmzdb
.
Once things work, I would be happy to reconsider merging rmzdb
into mzR
, although I am still convinced that it is better to keep them separate. The main reason I don't see why they should be merged is that one is unlikely to need the two backends simultaneously - most people will use only one of them 99% of the time. In addition, building and checking mzR
is already a pain.
@jotsetung:
Having yet another low level data access package would be IMHO problematic
My idea was not to have something low level but rather high level exposing an API that satisfies the need of MSnbase
.
Re converting mzML to mzDb, that would be nice, but I haven't spotted any export functionality yet in rmzdb.
Effectively, the mzML to mzDB feature is not supported yet by rmzdb. For now it's only possible by using the PWIZ-MZDB fork.
@lgatto: I see rmzdb as a thin wrapper around the libmzdb C library. Thus if we want something with a tight integration with MSnbase
I would opt for a third party package doing the glue for more flexibility.
As you spotted guys, this glue could target MSnbase
or mzR
. I'm not the best person to decide here.
However whatever the choice you make I would prefer to maximize the decoupling of the different components.
Another solution I could suggest, is to add a plugin functionality to mzR. If mzR could support pluggable backends then we could define an mzR-rmzdb-backend package that could be plugged dynamically on demand. Thus we would remove the rmzdb -> mzR dependency. I guess that a similar strategy could also be adopted for an integration in MSnbase.
I might be potentially be putting my foot in my mouth for proposing this so let me know if that is the case, but have you considered adding sqlite support, either based on previous work such as the now potentially gone YAFMS (at least I can't find any source files) or mzDB?
The main selling point I see for such an implementation would be getting rid of the indexing overhead compared to reading mzML files and potentially also both faster access times and smaller file sizes if implemented in the right way.
Just to give you an idea, a very rough hack converting a few mzML files to sqlite in a similar schema as presented in YAFMS resulted in about 28% reduction in size if the binary data was zip compressed (stored in a blob). Probably mostly due to the base64 encoding on the mzML side I would guess.
Might also be simpler to implement for writing processed/intermediate data. I figure most if not all can be written in R.
What do you think? @jotsetung