Support duplicate detectors in UnifiedRunner #24

Closed wandgitlabbot closed 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-07-07

Currently, the UnifiedRunner can only use a single instance of each detector. This is mostly due to lack of support in the configuration file format - most or all detectors do support using unique config keys, but the logic to specify which key applies to which input stream isn't there yet.

Work has been done to improve the modularity of the UnifiedRunner, which is a good first step. We may need to consider changing the config file format to something like YAML, which would allow us to essentially define entire data-flows from source to sink in a non-code format.

In GitLab, by Daniel Oosterwijk on 2020-07-07

23 is potentially relevant to this topic as well, since both concern reworking the UnifiedRunner to add more features. This issue describes adding support for several input streams and duplicate detectors working on each stream, while #23 concerns bounded input and output streams, such as those from historical gathered data files like we're looking at collecting from the ESnet API.

In GitLab, by Daniel Oosterwijk on 2020-07-07

Flink allows keyed operators to store unique state per stream. It could be interesting to look into a way to use a single instance of each detector, but configure its behaviour differently depending on what stream is being operated on at a given time. This would likely come hand-in-hand with Flink's parallelism support, such that there can be many copies of a single operator working in parallel (and potentially on different hosts). Each operator would have one or more streams to work on, and so per-stream configurations would need to be persisted in a parallelism-supporting manner.

This would be a better approach than to instantiate multiple operators for the same detector.

We will additionally need a way to tie configurations to input streams. Maybe our sources need to output dummy measurements that actually contain config overrides, in a similar way to what we mention in #23 with dummy measurements containing end-of-file markers.

In GitLab, by Daniel Oosterwijk on 2020-07-13

It looks like YAML and HOCON are the best candidates for a new config file format. More research will need to be done, potentially including just trying both of them out and seeing which one works nicer. StrictYAML has some good discussions on the drawbacks of other formats, but is only available in python.

In GitLab, by Daniel Oosterwijk on 2020-07-16

I'm working on reimplementing configuration using YAML on the yaml-configuration branch. All seems well so far, but there are a few notes I'd like to write down for the future.

I want to ship a sensible default config, then allow the user to fully revamp it later.
Since this config includes a dataflow pipeline description, users will want to be able to turn parts of it off as well as just overwriting particular values. In the current draft, the only values in the DAG are sink-enabled booleans.
This means we will need a way to allow the user to delete entire key hierarchies in their own configuration files. Of course, we don't want them to be able to delete keys from outside the flow description tree.
This could be done by just treating the 'all sinks disabled' case by not creating any of the earlier pipeline elements, but it means that users will have to manually disable every sub-tree they don't want. They might also just want to toss the whole thing out and start over, which will be another special case.
We might want to allow users to use multiple custom config files. These will need to be ordered so they can specify an override hierarchy, so conf/ will need an apache sites-enabled style naming scheme where files are prefixed with numbers to allow for easier sorting. 000-detectors.yaml followed by 500-flows.yaml, for example.

In GitLab, by Daniel Oosterwijk on 2020-07-16

Maybe not everything needs to be in one big ParameterTool. We could separate the flow config from the detector configs, and make detectors pull configs from a parameter instead of the GlobalConfiguration. That way, we can retain our nice yaml tree and have more flexibility with traversing it in the UnifiedRunner. We could probably also drop support from environment variable and program argument configuration, though the latter is useful for quick tests in the web interface.

In GitLab, by Daniel Oosterwijk on 2020-07-21

Turns out I missed the obvious consequence of using a YAML tree, in that it won't be able to support a detector with multiple inputs. I think I need to switch to representing a DAG with yaml... What's done so far:

Generate sources and filter their datatypes according to tree structure
Instantiate detectors, and tie their inputs and outputs to the right places
Custom configuration overrides per-node
Dynamic conf/ file loading.

In GitLab, by Daniel Oosterwijk on 2020-07-23

The YamlDagRunner is in a good state now. One thing that's not yet implemented is setting Flink UIDs for source filter stages and detectors. A decent UID schema would be to just chain the names of the visited nodes together. For example, if a source is called amp, then that would be its UID. A baseline detector which takes non-lossy ICMP data would be given the UID amp-icmp-notlossy-baseline.

This would break down if there are multiple instances of the same detector with the same datatype but different config. Perhaps it would be better to throw out readable UIDs entirely and hash the DetectorSchema (which would require hashing the DetectorInstances, and all SourceReferences and SinkReferences).

The readable UID format would still work there for source filter stages, but the builder function would need the name of the source passed to it.

In GitLab, by Daniel Oosterwijk on 2020-07-24

We should implement sources for LatencyTS measurements in the YamlDagRunner (#23). It's probably a good idea to also create sources for the archived results of the EsmondHistory grabber, and even consider finishing #21 and making CSV sources/sinks for AMP/other measurements. That should be fairly simple with CsvOutputable.

In GitLab, by Daniel Oosterwijk on 2020-07-29

We've implemented sources for LatencyTS measurements. Closing #23.

In GitLab, by Daniel Oosterwijk on 2020-07-29

The goal of this issue is complete in the YamlDagRunner. Closing.

wandnz / streamevmon

Support duplicate detectors in UnifiedRunner #24