microsoft / Qcodes

Modular data acquisition framework
http://microsoft.github.io/Qcodes/
MIT License
332 stars 314 forks source link

Loosen coupling between DataSet/Experiment and Sqlite storage #955

Open quantumkoen opened 6 years ago

quantumkoen commented 6 years ago

https://github.com/QCoDeS/Qcodes/pull/664 introduces a new DataSet and Experiment class, both of which are very hard coupled to Sqlite storage. This prevents re-use of the DataSet/Experiment classes in situations where sqlite backing is undesirable (for example, when implementing a more micro-service oriented architecture).

Instead of hard-coding calls to imported non-instance functions in the DataSet object, it would make more sense to use the dependency-injection design pattern. Ie, give the constructor an object that is responsible for the persistence of the DataSet, instead of relying on non-object methods imported from the sqlite_base module. So basically, have an object that represents the sqlite3 db, and define methods on that object that are now functions in the sqlite_base module.

That way, the whole DataSet class becomes more future-proof and allows for customisation by simply giving the constructor another object with the same interface but for example writing to mongodb, memory, hdf5 or whatever. Doing things this way breaks the dependency of DataSet on sqlite_base, which makes the code more maintainable and testable, and thus more future-proof and stable.

The same I think can be said for Experiment class.

nikhartman commented 5 years ago

This issue is quite old, but I am interested in what people think about it now. It seems there is no recent discussion about loosening the link between DataSet and sqlite. Is that something that is on the horizon?

And for my immediate use: If I wanted to work around this, would the obvious solution be to write a subscriber that fills HDF5 files (in parallel with sqlite or not) as new data comes in? Since I'm just getting started with this package, I'd like to have code that can keep up with development and not be depreciated in a few months (ie switching back to the old dataset).

astafan8 commented 5 years ago

@nikhartman This is definitely on the horizon, but the subject is quite complex, so I won't expand on that.

.... fills HDF5 files ....

There are some exporting functions like DataSet.get_parameter_data which return the data in a format that you can use to write to, say, hdf5 files, if you wish. We are also working on a pandas-format exporting function.

If I wanted to work around this ...

Depending on the extent you'd like to work around this, you can do it with respective success. Let me give some hints which should help. Since Measurement context manager is just a normal python context manager, the only side effect of using it without datasaver.add_result(..) call inside is that for each new context a new "empty" run in the database .db file is being created. Hence, you can (with or without datasaver.add_results) do my_hdf5_file.write(....) (or whatever the syntax is) inside the context manager. And since it's just pure python, you can do whatever you need inside and outside of the context manager. Yes, this is not consistent with the qcodes "way", but if the scope of this workaround is limited (say, you're the only one doing it this way among your colleagues), then it should not be harmful.

So far as deprecation is concerned, just being aware of the qcodes version that you are using would generally suffice. We are tolerating deprecation periods and all of that, so we are trying not to introduce any "shocking updates" unless very necessary. For example, the Loop, it's legacy, but it's still there for people who still find it useful.

nikhartman commented 5 years ago

@astafan8 That's useful information. Thanks for the reply.

That being said... The more I thought about this after posting, the less I think it is important to separate. You always want a way to be saving data as the experiment proceeds, so no data is lost in the event of an error or hardware failure. Using the database (as you've set it up) takes care of that problem, as well as adding all of the nice functionality of a database, if needed. It's basically what we've been doing for years with Igor .pxp files holding all the data in waves, as well as saving separate data files for sane people to analyze elsewhere.

Since storage is cheap, probably the solution that will keep us all happy here is to not change anything about the database and just duplicate the data into another format at the end of the run, using something like get_parameter_data, as you've suggested.

quantumkoen commented 5 years ago

@nikhartman for what it's worth, we're currently building our own DataSet implementation, that has this seperation of storage backend vs DataSet interface, with the intention to use it to store our data in MongoDB, with in addition HDF5 and other file-based storage backends. Sqlite is for our purposes an inconvenient choice. It's currently in a private repo, but we intend to open that up as soon as we have permission from the powers that be. But given that it's geared towards our use-cases, you may want to stick with whatever qcodes is providing.

astafan8 commented 5 years ago

@quantumkoen thanks for informing! If you make it public, it would be interesting to look at. Where can we monitor so that we get notified when/if you make that code public?

nikhartman commented 5 years ago

@quantumkoen I'm curious what makes MongoDB an improvement over sqlite for you?

quantumkoen commented 5 years ago

@astafan8 it will end up under our github account, i'll try to remember to give a heads-up in this ticket as well

@nikhartman mostly that we have a distributed setup, where there's one host connected to the equipment that's doing the measurement, and another host that consumes data from the measurements (for plotting for example); I know you can put a sqlite db on a network share or some such, but i've had bad experiences with that in the past (mostly due to the fact that file locking semantics and concurrent access are complex in such scenario's).

One could also argue that the dynamic nature of the data lends itself better to a document-database as opposed to the relational database sqlite3 is. The new qcodes dataset, iirc, tries to emulate some of that flexibility by basically implementing a key/value store on top of a relational database.

Another reason for a more centralized storage scheme is that one might want to aggregate all data from all experiments in a particular lab that might have 10 different setups and make those data searchable. Again not something that sqlite is the best fit for. It's doing well with concurrent access on one host, but once you factor in network access a DBMS with a tcp/ip interface is a better fit imho.

quantumkoen commented 5 years ago

@astafan8 we've just gotten the permission to open the repo up, so you can find our current work on the dataset for our use here: https://github.com/QuTech-Delft/qilib/tree/dev/src/qilib/data_set and https://github.com/QuTech-Delft/qilib/tree/feat/DEM-726/implement_data_set_io/docs/UML

It's very much a work in progress currently, and of course there's this PR now as well in qcodes itself: https://github.com/QCoDeS/Qcodes/pull/1415 which might make our work obsolete for others (for ourselves being able to have an interface that won't change often is still enough motivation to continue working on our own dataset)