Discussion: Experiment broadcast dataset namespacing

dnadlinger commented 3 years ago

The ndscan framework currently requires a patched version of sipyco and ARTIQ to run (https://github.com/OxfordIonTrapGroup/ndscan/issues/1). Apart from some minor changes to the dashboard code (to support a plugin interface), the only major pending change is the addition of dataset namespacing. The code for this has been in production use in Oxford for a long time now, so would be good to upstream. However, I have been somewhat avoiding to push it through the pipeline, as I am still not convinced it is in fact the right design, as described below.

To take a step back, the limitation at the core of this issue is that there are currently only two possible views on data produced by experiments: The local perspective, containing only datasets set or queried by the experiment, which is only available to the experiment code itself, and the global view represented by the master's dataset database sync_struct. However, for interactive use (or monitoring), it is often desirable to display the data produced by each run in real time in the form of a plot applet.

There is a fundamental mismatch to the ARTIQ dataset model here: The only way to get at the data produced by an experiment from outside the worker process (apart from opening the HDF5 file after the run is finished) is to save them to broadcast datasets. However, since there is only one global key namespace for those, there is then the risk of interference between multiple concurrently running experiments. A common example for this would be an interactively run experiment being interrupted by a periodically scheduled calibration run (the results of which would also be plotted for monitoring), or a short experiment to verify some parameter started while a long data acquisition run is going on with lower priority. If each of those experiments were to write its results to a common dataset (say, data.count_rate) to be read by plot applets, broadcast dataset clients would observe some inconsistent mix between the data from both experiments, at least transiently.

Thus, some mechanism for disambiguation between multiple runs is needed; the topic of this issue.

There are at least three approaches that come to mind:

1) Have all experiments write their results beneath some uniquely named subtree to manually disambiguate between runs. The only real option for this is to include the run ID in the key part, as otherwise multiple instances of the same experiment collide with each other. (This comes up frequently for code used both from interactive experiments and automated calibrations.) 2) Take care of the issue in the master process by offering alternate views only consisting of the datasets produced by a given run, i.e. a separate kv store per experiment that is then also followed into the global kv store (dataset hierarchy). 3) Provide a way for client code (applets, …) to directly connect to a worker process in order to subscribe to its private dataset view.

ndscan, resp. its patched ARTIQ version, currently implements option 2). A solution of some sort is definitely necessary, and the current implementation has served us well. Before proposing an "official" solution for this in upstream ARTIQ, I wanted to take a step back and evaluate the merits of each approach in a more principled fashion, in particular as 2) doesn't address another, tangentially related issue (lingering datasets), which 1) happens to also address, and 3) does not suffer from:

1) Unique dataset key prefix per experiment:

Advantages:
- Conceptual simplicity; no changes necessary to the dataset model or management code proper.
- Can be implemented in user code only.
  - Disadvantages:
  - User data analysis code needs to deal with mangled dataset keys: Using ndscan as an example, to perform some simple data analysis on the fly of a scan of a fragment with a results channel counts, users can currently write a trivial script that just reads datasets/ndscan.points.axis_0 and datasets/ndscan.points.channel_counts from the HDF5 file without any external libraries, etc. If ndscan were to emulate namespaces by e.g. writing everything under ndscan.rid_12345 instead, the user code would at least need to manually include the RID everywhere, or pull in some analysis library that figures out the root path prefix.
  - To avoid unbounded memory growth of the dataset db, there needs to be some external cleanup daemon that monitors the master schedule and removes old result data (e.g. after some fixed delay after the RID disappears from the schedule).

2) Master provides dataset namespaces:

Advantages:
- Simple interface experiment-side, works without user intervention.
- Still simple to implement.
- Lifetime management of data can very naturally be handled in the master process (a simple background timer on RID deletion).
- Solves the problem without every piece of user experiment code having to re-invent a solution.
Disadvantages:
- Even though extra namespaces can be cleanly GC'd, if datasets are merged into the global namespace by default (to keep the experiment-side API simple), uniquely named datasets still accumulate there. Having deployed this solution over the last few years, we do end up having to manually delete large datasets from time to time that end up slowing (or even completely breaking) initial dataset db sync.

3) Applets directly access data from worker processes:

Advantages:
- Can avoid broadcasting large datasets not of interest to consumers of the dataset db as a calibration value store altogether.
- Old results are automatically cleaned up.
Disadvantages:
- Adding direct access to worker processes adds some complexity to the conceptual structure.
- Worker processes now need to directly deal with messy lifetime semantics (need to keep data around after experiment end to avoid race conditions with applets that are just starting); probably meaning an internal, timed delay pipeline stage to keep the process around for some time.
- Data is gone if worker process dies.

In trying to clean up the last remaining bits of ndscan ARTIQ diff, I am currently leaning towards dropping 2) in favour of 1), as a cleanup process is only slightly more complex implementation-wise than the integrated solution (if people remember to start it), and the convenient side effect of fixing the dataset accumulation problem seems to be enough to offset having to deal with slightly annoying dataset keys when writing bare analysis scripts without a support library.

I am curious what other people think, though, in particular, if and how anybody is solving this in their own libraries. Do people have applications whether the self-consistent view into an experiment's dataset tree afforded by 2) would be useful (as not every code base is as easily adapted to deal with this as ndscan with its centrally controlled dataset sinks)?

lriesebos commented 3 years ago

I will share my thoughts from how we use datasets at Duke.

We currently use something like solution 1 with our own libraries. Though we separate plotting data and archived data. Plotting data is stored in a non-archived non-persistent broadcasted dataset with a key format like plot.{rid}.y_data. Raw data for the HDF5 archive is not broadcasted and stored under some standardized keys to find it easily when analyzing the HDF5 file. Separating plotting and archiving data allows us to process the plotting data stream separately from the raw data archiving. This works well for us, though we still have accumulating non-persistent broadcasted datasets used for plotting. Tbh, an option in the dashboard to clean all non-persistent datasets could probably solve that easily for us. Besides from plotting and archiving data we also use the persistent datasets to store configuration and calibration data shared between experiments.

Regarding the other solutions you propose, solution 2 is not fully clear to me. Does this mean that every RID has its own namespace that can read from the global namespace but is only merged into the global namespace once the experiment ends? I assume that only persistent broadcasted datasets need to be merged then? And what if you want multiple RID's to append to the same non-persistent broadcast dataset for a single applet? Would be great if you could elaborate on that.

To me, solution 3 seems to contradict the idea of the ARTIQ master managing a global dataset db which allows multiple dashboards to plot the same results etc. Not sure if that is desired.

At this moment I am totally fine with the way how datasets work. It allows us to have simple manual control over all the aspects of it. Though I am open to new ideas.

dnadlinger commented 3 years ago

Thanks for the comments.

Does this mean that every RID has its own namespace that can read from the global namespace but is only merged into the global namespace once the experiment ends?

Not quite – the datasets are also immediately visible in the global namespace. The idea when implementing this was to keep existing code work precisely the same without any changes, while also allowing extra "hygienic" access for clients that want it. To that end, the master just creates a new sync_struct Notifier for each run, and applies updates (the mods received from the workers) to both that rid-specific copy, and the global one. In the global namespace, two experiments trying to append to the same dataset, etc. produce exactly the same outcomes they currently do.

At this moment I am totally fine with the way how datasets work. It allows us to have simple manual control over all the aspects of it. Though I am open to new ideas.

This is ultimately why I have a slight preference for ditching my existing implementation of 2) and go for 1) instead, even though it means running an extra tool and having slightly messier result keys.

Plotting data is stored in a non-archived non-persistent broadcasted dataset with a key format like plot.{rid}.y_data. Raw data for the HDF5 archive is not broadcasted and stored under some standardized keys to find it easily when analyzing the HDF5 file.

This is somewhat off-topic here, but in the long run, I've found it really convenient to ensure that the HDF5 files have all the information necessary to reproduce the applet plots. In our setup, you can just do ndscan_show alice_12345 to fetch and display the plot(s) produced by RID 12345 on experiment alice. This can be quite handy when later trying to make sense of old lab book entries, or trying to figure out which RID is the one you were looking for because you forgot to note the right one down/paste a screenshot into the lab book, although that's obviously only interesting for conceptually simple experiments where the plots actually show something interesting (rather than ones just running a bunch of application circuits).

lriesebos commented 3 years ago

Thanks for elaborating. The namespace idea is clear and I have nothing against it. Though I am not sure if we would use it if available as we are satisfied with the current structure.

For the off-topic component, I am always interested to learn how other people designed their software, so let me also elaborate on that. As you know, we are not using ndscan, but ndscan_show seems quite an interesting feature. As mentioned before, our data model does not store plotting data but archives raw data in an organized manner. Then we wrote analysis classes which can extract this data from the HDF5 file. The user can choose to access raw data through this object or to call some simple plotting functions that will process and plot data using matplotlib.

charlesbaynham commented 3 years ago

Some great thoughts here.

In our code, we implement something similar to option 1: we have an interfacing library that handles the namespacing of datasets and the building of plots. This doesn't store by RID however - it stores datasets by the name of the calling Experiment. This way library code that is reused in multiple Experiments doesn't collide as long as the calling Experiment is different. This approach would fail if two of the same Experiment was run simultaneously, but that hasn't been a problem yet.

Re plotting data, we store all raw data in the HDF file, so any plots which are shown can be reproduced from the archives. The interface library ensures that all plots have the RID in their title so screengrabs in the lab book can be reproduced. Most of our plots, however, come either from post-processing the HDF results or from Grafana plots of database data (which we use for long-running experiments, replacing the dataset system).

Considering the options you presented, option 3 seems to me to be quite complex. Adding the ability for applets to break into "local" storage would effectively mean that there is no longer any local storage in terms of implementation. Any dataset might need to be broadcasted on request, and neither the experiment nor the master would know which would be requested.

Option 1 seems simplest to me. Are you proposing that this be handled in user code? Alternatively get_dataset could be given a flag namespaced=False which, if True, appends the appropriate namespacing pattern automatically. It definitely should be opt-in though, so that global storage of e.g. calibration data is still possible.

ljstephenson commented 3 years ago

I'll echo most sentiments in that option 1 looks to be the most sensible, and I think that RID is preferable to experiment class name in case of collisions. I like @charlesbaynham 's idea for a namespaced keyword argument too -- an extension to this might be that only the master's version was mangled, so that client analysis code for the HDF5 wouldn't need to know whether or not a given dataset had been namespaced. That said, I suppose if the mangling is a library function based on RID/class name, all of these things are available in the HDF5 so implementing another function that can do the same directly from the file is trivial.

A cleanup daemon looks like a useful feature regardless!

charlesbaynham commented 3 years ago

One point in favour of namespacing by experiment name which I forgot to mention is that cleanup isn't necessary, since previous datasets are replaced by later ones. It also means that applets that are launched always show the latest data, without needing to stop them and start the new one.

For the argument namespaced, it could actually take a string (default "") so that users could pass e.g. "{rid}.{exp_id}". This would have .format() called on it with an appropriate dict of useful values, so that the user could choose their naming scheme. If we wanted, the HDF file could have the original dataset names without their namespace prefix.

dnadlinger commented 3 years ago

One point in favour of namespacing by experiment name which I forgot to mention is that cleanup isn't necessary, since previous datasets are replaced by later ones. It also means that applets that are launched always show the latest data, without needing to stop them and start the new one.

Cleanup is still necessary if you have many different experiments, or some of the datasets are large and seldomly used. This admittedly sounds a bit contrived, but we actually run into this in practice here from time to time. This is with option 2) implemented (which we've had deployed for a couple of years), i.e. without any extra namespacing. Datasets pile up in the global namespace due to different ndscan result channel names (things like count_rate, p, correlated_phase, …), increasing memory consumption and leading to embarrassingly slow dashboard starts. At some point, it even becomes impossible to start new dashboards altogether, as SiPyCo tries to send the whole contents in a single init mod, which it then rejects as being too large.

dnadlinger commented 3 years ago

Option 1 seems simplest to me. Are you proposing that this be handled in user code?

Yes, the namespace would just be created explicitly by user code; e.g. ndscan would just save your count_rate ResultChannel values to ndscan.rid_1234.points.channel_count_rate instead of ndscan.points.channel_count_rate. ndscan would also supply something like a ndscan_dataset_janitor executable that connects to the master and takes care of discarding old ndscan.rid_* trees after a while.

I don't know if it is simpler than (2), but it's the most standalone one, allowing (but also requiring) everyone to solve this issue in their own way.

Alternatively get_dataset could be given a flag namespaced=False […]

I'm not sure I follow here. In virtually all instances, the experiment code affected by namespacing would be set_dataset-ing values, not retrieving them. The latter would be done by applets, and analysis scripts consuming the HDF5 files.

charlesbaynham commented 3 years ago

For the namespaced flag I was mentioning you're right, it's most relevant for set_dataset, setattr_dataset, mutate_dataset and append_to_dataset, though you'd probably want it for get_dataset too. In its dumbest implementation it'd just rewrite key for you by inserting the RID. A slightly smarter version would do that for broadcasted datasets, but stop the prefix from ending up in the HDF5 files so that analysis scripts don't need to worry about it.

If namespaces became an official part of ARTIQ then the cleanup policy could be too. That might avoid the need for a cleanup daemon, since you could set e.g. "keep namespaced datasets from the last N experiments" which would get evaluated after any experiment completes. You'd probably want to leave non-namespaced datasets alone.

philipkent commented 3 years ago

I would also be in favor of option 1 along with an optional namespace argument to set_dataset, get_dataset, etc. This is essentially how we handle namespacing to avoid collisions. Each of our library classes, scan classes, etc. define their own unique namespace and we use a wrapper class for calling dataset methods that prepends that namespace to the key.

We ran into the same issue of a higher priority experiment sharing a namespace with a paused experiment, meaning the higher priority experiment would overwrite the paused experiment's data. Our solution is to keep a copy of the paused experiment's data on the client and simply broadcast that data back to the master when the paused experiment resumes. Option 2 seems like a more elegant solution and would reduce our code complexity but would be more of a "nice to have" for us.

If some labs wanted to implement more advanced namespacing (such as automatically appending the RID or experiment name to the key) or automatic collision handling (such as option 2), maybe a compromise is to allow the ARTIQ dataset handling to be extended by optional "plug-ins". A lab could write simple extensions to ARTIQ that define how default namespaces are created, e.g. always prepend the RID to key, or even write their own dataset handling "engine" that would implement option 2. Extensions could also be published in the main ARTIQ repository in case other labs find them useful.

dnadlinger commented 3 years ago

I would also be in favor of option 1 along with an optional namespace argument to set_dataset, get_dataset, etc.

Could somebody arguing in favour of this sketch out a proposal of how this would actually look like in practice (maybe an actual implementation, or just a detailed description of the behaviour of the various arguments)? From my perspective, the distinction between option 1 and option 2 was precisely that the first doesn't have any support in the master code (and conceptual structure of the dataset db), preferring to do everything by convention client-side instead.

I guess that what you and others are thinking about is sort of a hybrid between my above options 1 and 2, in that the master handles namespaces (like in option 2)), but they are explicitly exposed as subtrees of the dataset db with a certain prefix, rather than published as a different struct?

Our solution is to keep a copy of the paused experiment's data on the client

By "on the client", do you mean in the experiment code (i.e. the paused worker process)? Or which other client are you referring to?

philipkent commented 3 years ago

By "on the client", do you mean in the experiment code (i.e. the paused worker process)? Or which other client are you referring to? Yes, in the paused worker process we keep a copy of everything that will be overwritten and then broadcast those values back to the master when the paused process resumes.

philipkent commented 3 years ago

Could somebody arguing in favour of this sketch out a proposal of how this would actually look like in practice (maybe an actual implementation, or just a detailed description of the behaviour of the various arguments)? From my perspective, the distinction between option 1 and option 2 was precisely that the first doesn't have any support in the master code (and conceptual structure of the dataset db), preferring to do everything by convention client-side instead.

My take on this was that determining how datasets are namespaced would be handled exclusively by user code, as in option 1, but it might be nice to have some commonly used identifiers inserted into the namespace automatically by master; such as the RID or experiment name. master would not change from how it currently stores datasets, i.e. there would still be only one global key-value store, but master would simply replace certain tokens in the namespace argument to get_dataset, set_dataset etc. and prepend that string to the dataset key (when requested by the user) before writing to the kv store.

For example:

self.set_dataset('microwaves.stretch.frequency', 1*GHz, namespace="{rid}")

Would instruct the master to store 1*GHz in the global kv store at key=12345.microwaves.stretch.frequency if the RID is 12345.

master would then handle the replacing of tokens in the namespace argument and would also handle deciding if the namespace should be prepended to the key argument or not by doing something like:

# namespace, key, and value variables are set to the values passed by the user to set_dataset, etc....
if namespace is not False:
   namespace = namespace.replace("{rid}", scheduler.rid)
   key = '{0}.{1}'.format(namespace, key)
   set_dataset(key, value)
else:
   set_dataset(key, 1*GHz)

This seems like it would give the most flexibility to the user on how "collisions" between keys are handled since they are still implementing their own namespacing scheme. I guess the namespace argument itself is really a convenience since we can easily create the key in user code using '{}.microwaves.stretch.frequency'.format(self.scheduler.rid). Handling cases where you might want to dynamically turn off namespacing for what ever reason starts to complicate the user code, though, since you would need implement the if/else cases above in user code.

Maybe I've misunderstood what other's had in mind with the namespace argument to get_dataset, set_dataset, etc. but this is how I understood the suggestion.

charlesbaynham commented 3 years ago

@philipkent That's more or less what I was thinking too. As presented, it's a convenience feature that users could already implement in experiment code if they wanted. One extension which currently isn't possible would be to amend the structure in the saved HDF5 file to remove the prefixed namespace, but keep it in the broadcasted datasets. I haven't looked into the code to see how easy this would be, but it'd mean that analysis scripts don't need to worry about the details of namespacing.

dnadlinger commented 3 years ago

One extension which currently isn't possible would be to amend the structure in the saved HDF5 file to remove the prefixed namespace, but keep it in the broadcasted datasets. I haven't looked into the code to see how easy this would be, but it'd mean that analysis scripts don't need to worry about the details of namespacing.

Since this issue just came up in my notifications again with Charles supposedly commenting 19 hours ago (?!): This is more or less what my current implementation of 2) does; the only difference being that it is a separate Publisher for the broadcasts instead of a key prefix.

charlesbaynham commented 3 years ago

(someone linked to this issue on ARTIQ day so I noticed and corrected a typo in my post - that probably refreshed the notification!)

dnadlinger commented 2 years ago

Closing this now, as I've changed ndscan from what in terms of the above is approach 2 to approach 1 (i.e. explicit RID prefixes and an external cleanup process).

m-labs / artiq

Discussion: Experiment broadcast dataset namespacing #1614