sneumann / mzR

This is the git repository matching the Bioconductor package mzR: parser for netCDF, mzXML, mzData and mzML files (mass spectrometry data)
42 stars 27 forks source link

sqlite support? #131

Open SiggiSmara opened 7 years ago

SiggiSmara commented 7 years ago

I might be potentially be putting my foot in my mouth for proposing this so let me know if that is the case, but have you considered adding sqlite support, either based on previous work such as the now potentially gone YAFMS (at least I can't find any source files) or mzDB?

The main selling point I see for such an implementation would be getting rid of the indexing overhead compared to reading mzML files and potentially also both faster access times and smaller file sizes if implemented in the right way.

Just to give you an idea, a very rough hack converting a few mzML files to sqlite in a similar schema as presented in YAFMS resulted in about 28% reduction in size if the binary data was zip compressed (stored in a blob). Probably mostly due to the base64 encoding on the mzML side I would guess.

Might also be simpler to implement for writing processed/intermediate data. I figure most if not all can be written in R.

What do you think? @jotsetung

jorainer commented 7 years ago

There have been already some discussions on that. A strong argument for me to stick to mzML and mzXML is that they are relatively stable standard formats and I would avoid converting the MS data to yet another file format that is not cross platform and cross software compatible.

Regarding speed, I have compared once (indexed) SQLite database access time and access times from an original mzML file format and I did not see a large difference. I admit that this was just one simple test but since I didn't gain any significant speed improvement I didn't investigate further.

Note: there is also https://github.com/thomasp85/MSsary that did take a similar avenue saving intermediates in SQLite fils.

I could imagine that we think about that in future - but at present there is other stuff that has a higher priority.

lgatto commented 7 years ago

While mzR is the format that we used (not because it is elegant and efficient, but because it is widely known and used), I think it would be interesting to have support for SQLite-based MS data. I met with the mzDB author some time ago, and I mentioned my interest, but I don't think the C++ bindings were complete or documented at the time. Also, I don't remember of the conversion to mzDB was straightforward. I have never seen any file in YAFMS.

I wouldn't have any time to start anything like that, but could help out if somebody else took the lead.

Re MSsary, it is stalled, and I think that Thomas has other interests now.

SiggiSmara commented 7 years ago

It is definitely a valid concern to introduce another format for MS data given the history of open standard MS data formats. That said, everyone is using the current formats for the reasons both of you mention, namely stable, known and used, not because they necessarily fit the purpose of data analysis.

I haven't been entirely convinced on the two approaches I mentioned above, and I haven't looked at MSsary at all so can't comment on their approach. But I think if we put our heads together and possibly get some others that we know are thinking about these things we might be able to come up with some design objectives of a new format that has the focus on data analysis.

I'd me more than happy to chip in as much as I can or take the lead if necessary. Just be aware that I am a hack, not a programmer. I have a fairly good knowledge and experience with databases, both open source and commercial ones, in designing and using them but my programming for the most part has been in python and other scripting languages (php anyone? I almost don't dare to mention such things in public). @jotsetung can attest that my R skills are very much on the beginners side if present at all.

My current thinking is that a) it should be perhaps investigated if this improves anything in a more systematic way and b) if this is deemed something worth putting in place, to focus first on the usefulness for mzR/XCMS/MSnbase and secondly to think about if this is something useful for the general field of MS data analysis or beyond.

jorainer commented 7 years ago

I could dig out my tests I've done last year to compare SQLite vs mzML and do some more checks. If this would increase speed we could think of a third mode = "inSQLite" or something similar, i.e. during data processing the data is stored in SQLite file format. But there, updating values in a SQLite database can be quite time consuming too, especially if there are indices in place.

I would definitely not want to define a new standard format. Better to have something in place that works with MSnbase and (most importantly) can be exported to mzML at any time.

SiggiSmara commented 7 years ago

I would suggest using previously published data sets that have been used for speed testing. One such data set is found in the MS-numpres paper. Direct link to it is here http://webdav.swegrid.se/snic/bils/lu_proteomics/pub/ms-numpress/. Another perhaps ignorant but related question, is reading MS numpress-ed data supported in mzR?

I'm not an advocate for a new standard (see points above), but in order to find a format that is an improvement in speed and possibly size for our work I do think it is necessary to spend time to come up with a good solution. And I agree that it should be possible to export to mzML.

david-bouyssie commented 6 years ago

Hi there,

Good news guys. We will soon update our C reader for mzDB files (https://github.com/mzdb/libmzdb). A student is helping me to implement some Rcpp binding on top of libmzdb. We are currently facing some issues regarding the usage of Rcpp with C code (mainly regarding the usage of C structs). So any help on this side would be welcome.

Next week we will commit our new version of libmzdb and the draft of our R package stupidly named libmzdbR.

lgatto commented 6 years ago

Great new, @david-bouyssie

Next week we will commit our new version of libmzdb and the draft of our R package stupidly named libmzdbR.

I would recommend libmzdbr - changing case has proven to be challenging for users :-)

I'm looking forward to try it out.

david-bouyssie commented 6 years ago

@lgatto ok thank you for the naming suggestion

david-bouyssie commented 6 years ago

Good news guys ;-)

The draft of our libzmdb R bindings is now on github: https://github.com/mzdb/rmzdb

This project is still experimental, but we expect to have a working version for the end of the month. If you have any advice, feel free to create an issue on the corresponding repo.

david-bouyssie commented 6 years ago

Thanks to @ValentinCamus work, we finally have our first working version of rmzdb :)

Little things that still need to be done:

Currently unit testing is done using the Perl bindings. But it won't be too difficult to port these tests to R.

In the meantime we can discuss the possible integration in mzR.

Have a nice summer,

David

david-bouyssie commented 6 years ago

Some complementary information regarding previous remarks:

A strong argument for me to stick to mzML and mzXML is that they are relatively stable standard formats and I would avoid converting the MS data to yet another file format that is not cross platform and cross software compatible.

The mzDB specs (https://github.com/mzdb/mzdb-specs) are unchanged since 3 years, so you can consider them as stable. We use the format in production in our lab since 2014. The new libmzdb C library aims to provide a cross-platform solution to access mzDB files. It's is not yet feature complete but should be very soon.

Regarding speed, I have compared once (indexed) SQLite database access time and access times from an original mzML file format and I did not see a large difference.

SQLite doesn't really shine when dealing with full MS spectra. The main improvements of mzDB compared to *ML formats are:

I would definitely not want to define a new standard format. Better to have something in place that works with MSnbase and (most importantly) can be exported to mzML at any time.

The mzDB format is partially based on mzML. If you open an mzDB file using an SQLite viewer, you'll see that meta-data are encoded as XML chunks following the mzML specification. In other terms mzDB tables can be aligned with mzML nodes, except for the storage of spectra data which are stored in specific binary structures. Actually we had a tool called mzDB2mzML (https://github.com/mzdb/pwiz-mzdb) but this tool has been disabled from our build for technical reasons. FYI the build of this tool is in listed in our current roadmap. However, pwiz-mzdb relies on ProteoWizard so as you know it's not really lightweight. Thus I'm now thinking that porting mzDB2mzML to libmzdb/rmzdb could be a much better solution.

lgatto commented 6 years ago

Thank you @david-bouyssie and @ValentinCamus - looks very interesting. A few quick questions and comments.

x[[10]] ## 10th spectrun
x <- addIdentificationData(x, "x.mzid")
...

This would make it completely transparent for existing users.

PS: There could even be a

x <- readMSData(c("x1.mzML", "x2.mzML", ...), mode = "mzDB")

that would do the conversion to mzDB under the hood.

david-bouyssie commented 6 years ago

Hi Laurent,

Could you elaborate on the conversion to mzdb (from mzML for instance)

There is no problem to convert from mzML to mzDB, at least in profile mode. Indeed, in the case of Thermo data, peak picking works better if you convert directly from the raw file.

However converting from mzDB to mzML requires to fix the tool named mzDB2mzML. It's in our roadmap but was flagged as low priority until now.

I had a quick look at the rmzdb package. One thing that you might want to consider it to write an R interface to it. By that, I mean an abstraction in R, a set of R function that call the Rcpp module in the background.

I thought that this abstraction would be managed by mzR, which would then expose mzDB through its own API:

mz <- openMSfile(file, backend = "mzDB")

If I understand what you are suggesting, you don't want to integrate the mzDB reader as a new mzR backend but rather as a new MSnbase data source. Did I get it correctly?

jorainer commented 6 years ago

How I understood it (please correct me @david-bouyssie if I'm wrong), the idea would be to first convert the files from raw into mzDB (instead of mzML) and then load and analyze these files with MSnbase and others. For that it would indeed make more sense to add the functionality to handle these files in mzR.

There would then be a new backend in mzR with the methods header, peaks, spectra etc implemented for mzDB files as you suggested @david-bouyssie . Only, if I get it correctly, the main benefit from the mzDB files is the improved indexing: while mzML indexes only the retention time, the mzDB indexes both the retention time and the m/z (again, please correct me if I'm wrong here). Problem is that the mzR::spectra methods use only the retention time/scan index for the subsetting, so, without larger changes, we would loose the benefits from faster access by m/z subsetting. Currently, for mzML and CDF files we get a speed improvement selecting only spectra from certain rt ranges/scan indexes but have to filter the full spectrum by m/z afterwards. The mzDB backend could support both filters in one call.

lgatto commented 6 years ago

How I understood it (please correct me @david-bouyssie if I'm wrong), the idea would be to first convert the files from raw into mzDB (instead of mzML)

It would useful to also be able to convert from mzML to mzDB

and then load and analyze these files with MSnbase and others. For that it would indeed make more sense to add the functionality to handle these files in mzR.

I don't think we need to add anything to mzR (I would prefer not to, it's complicated enough as it is). MSnbase could depend on rmzdb for the SQLite backend and provide the interface, as it does for mzR.

david-bouyssie commented 6 years ago

@jotsetung: I do agree with all your comments/suggestions. Can't summarize better ;)

@lgatto: could we imagine an intermediate solution? We could create a package XXX that depends on mzR and rmzdb and that would act as an mzR proxy. This package would expose a subset of the mzR API (useful for MSnbase) + some useful access methods optimized for mzDB (like XIC access for instance). Then MSnbase could depend on this intermediate package. This solution would imply no modificiation of the mzR codebase but would provide more flexibility in the case you don't want the MSnbase features and only want to use the MS data directly. Note that instrument vendors like Bruker and AbSciex plan to deliver SQLite based formats in the future (the wiff2 format for AbSciex and the TimsTOF Pro specific format for Bruker). Bruker will provide a C based library to access their data. So we could also imagine in the future that this XXX package could also wraps the access to these other SQLite based formats. And at the some points if this wrapping package is mature enough then we could think about merging its codebase with the mzR one.

For the integration in MSnbase we will still need a small glue that would call the XXX package API to consume the data.

The main disadvantage of this proxy solution is "API maintenance": if a new method is added to mzR and is required by MSnbase then the proxy has to be updated to expose this new method. Plus, I know that mzR is already an abstraction layer over different readers and its never good to do the same thing twice. But in this specific case I think it's worth a try.

jorainer commented 6 years ago

I think it would be cleanest to implement rmzdb support right into mzR as this is supposed to be the main interface package for MS data files. That would also significantly simplify the integration into MSnbase.

Having yet another low level data access package would be IMHO problematic: while inMem should be OK, the onDisk mode would be more complicated because of the different low level interfaces that would have to be used. For that I see 2 solutions: 1) Define yet another object similar to OnDiskMSnExp that handles data retrieval from the rmzdb. 2) Use a lot of if conditions in the data access functions to switch between the mzR and the mzdb backends. I'm not really happy with both options which makes me believe the mzR integration (although mzR is already heavily overloaded) to be the better solution.

Re converting mzML to mzDb, that would be nice, but I haven't spotted any export functionality yet in rmzdb.

lgatto commented 6 years ago

I don't see why we would need another package. rmzdb should depend on ProtGenerics (and possibly mzR) and implement the relevant accessor methods (like mzR does) as well as it's own ones, peculiar to that backend. MSnbase can then implement (using OnDiskMnExp or, probably better, a new backend) a high level SQLite interface by depending (suggesting actually) rmzdb.

Once things work, I would be happy to reconsider merging rmzdb into mzR, although I am still convinced that it is better to keep them separate. The main reason I don't see why they should be merged is that one is unlikely to need the two backends simultaneously - most people will use only one of them 99% of the time. In addition, building and checking mzR is already a pain.

david-bouyssie commented 6 years ago

@jotsetung:

Having yet another low level data access package would be IMHO problematic

My idea was not to have something low level but rather high level exposing an API that satisfies the need of MSnbase.

Re converting mzML to mzDb, that would be nice, but I haven't spotted any export functionality yet in rmzdb.

Effectively, the mzML to mzDB feature is not supported yet by rmzdb. For now it's only possible by using the PWIZ-MZDB fork.

@lgatto: I see rmzdb as a thin wrapper around the libmzdb C library. Thus if we want something with a tight integration with MSnbase I would opt for a third party package doing the glue for more flexibility.

As you spotted guys, this glue could target MSnbase or mzR. I'm not the best person to decide here. However whatever the choice you make I would prefer to maximize the decoupling of the different components.

Another solution I could suggest, is to add a plugin functionality to mzR. If mzR could support pluggable backends then we could define an mzR-rmzdb-backend package that could be plugged dynamically on demand. Thus we would remove the rmzdb -> mzR dependency. I guess that a similar strategy could also be adopted for an integration in MSnbase.