Closed wandgitlabbot closed 3 years ago
In GitLab, by Daniel Oosterwijk on 2020-07-07
In GitLab, by Daniel Oosterwijk on 2020-07-07
Flink allows keyed operators to store unique state per stream. It could be interesting to look into a way to use a single instance of each detector, but configure its behaviour differently depending on what stream is being operated on at a given time. This would likely come hand-in-hand with Flink's parallelism support, such that there can be many copies of a single operator working in parallel (and potentially on different hosts). Each operator would have one or more streams to work on, and so per-stream configurations would need to be persisted in a parallelism-supporting manner.
This would be a better approach than to instantiate multiple operators for the same detector.
We will additionally need a way to tie configurations to input streams. Maybe our sources need to output dummy measurements that actually contain config overrides, in a similar way to what we mention in #23 with dummy measurements containing end-of-file markers.
In GitLab, by Daniel Oosterwijk on 2020-07-13
It looks like YAML and HOCON are the best candidates for a new config file format. More research will need to be done, potentially including just trying both of them out and seeing which one works nicer. StrictYAML has some good discussions on the drawbacks of other formats, but is only available in python.
In GitLab, by Daniel Oosterwijk on 2020-07-16
I'm working on reimplementing configuration using YAML on the yaml-configuration
branch. All seems well so far, but there are a few notes I'd like to write down for the future.
conf/
will need an apache sites-enabled style naming scheme where files are prefixed with numbers to allow for easier sorting. 000-detectors.yaml followed by 500-flows.yaml, for example.In GitLab, by Daniel Oosterwijk on 2020-07-16
Maybe not everything needs to be in one big ParameterTool. We could separate the flow config from the detector configs, and make detectors pull configs from a parameter instead of the GlobalConfiguration. That way, we can retain our nice yaml tree and have more flexibility with traversing it in the UnifiedRunner. We could probably also drop support from environment variable and program argument configuration, though the latter is useful for quick tests in the web interface.
In GitLab, by Daniel Oosterwijk on 2020-07-21
Turns out I missed the obvious consequence of using a YAML tree, in that it won't be able to support a detector with multiple inputs. I think I need to switch to representing a DAG with yaml... What's done so far:
In GitLab, by Daniel Oosterwijk on 2020-07-23
The YamlDagRunner is in a good state now. One thing that's not yet implemented is setting Flink UIDs for source filter stages and detectors. A decent UID schema would be to just chain the names of the visited nodes together. For example, if a source is called amp
, then that would be its UID. A baseline detector which takes non-lossy ICMP data would be given the UID amp-icmp-notlossy-baseline
.
This would break down if there are multiple instances of the same detector with the same datatype but different config. Perhaps it would be better to throw out readable UIDs entirely and hash the DetectorSchema (which would require hashing the DetectorInstances, and all SourceReferences and SinkReferences).
The readable UID format would still work there for source filter stages, but the builder function would need the name of the source passed to it.
In GitLab, by Daniel Oosterwijk on 2020-07-24
We should implement sources for LatencyTS measurements in the YamlDagRunner (#23). It's probably a good idea to also create sources for the archived results of the EsmondHistory grabber, and even consider finishing #21 and making CSV sources/sinks for AMP/other measurements. That should be fairly simple with CsvOutputable.
In GitLab, by Daniel Oosterwijk on 2020-07-29
We've implemented sources for LatencyTS measurements. Closing #23.
In GitLab, by Daniel Oosterwijk on 2020-07-29
The goal of this issue is complete in the YamlDagRunner. Closing.
In GitLab, by Daniel Oosterwijk on 2020-07-07
Currently, the UnifiedRunner can only use a single instance of each detector. This is mostly due to lack of support in the configuration file format - most or all detectors do support using unique config keys, but the logic to specify which key applies to which input stream isn't there yet.
Work has been done to improve the modularity of the UnifiedRunner, which is a good first step. We may need to consider changing the config file format to something like YAML, which would allow us to essentially define entire data-flows from source to sink in a non-code format.