wandnz / streamevmon

Framework and pipeline for time series anomaly detection
GNU General Public License v3.0
1 stars 1 forks source link

Event grouping #36

Open wandgitlabbot opened 3 years ago

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-10-02

It could be useful to have another layer of processing after detectors which will group events that are raised at similar times. Some notes:

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-10-04

Richard N mentioned that it should be useful to have topological grouping for alerts. We will have a number of similar streams, such as any amplet -> google, and it would be useful to group a proportion of these streams which are anomalous. We should also probably be investigating traceroute data, to give us the following grouping information:

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-10-12

I'm working on implementing AMP traceroute support in order to build a graph of paths between known amplets. Amplets can also provide AS mappings, so I won't bother implementing a way to look those up myself. I'd prefer to match the results given by amplets if I were to do that, but the API being used appears to need a key, and I don't want to bother setting up support for that yet. Instead, I'll just rely on whatever the amplet sends me, and if AS lookups were disabled on their end, we'll not bother providing them.

Each measurement has a path (aka inet path), and an optional AS path. Multiple inet paths can be paired a single AS path, but only one AS path can be paired with each inet path (assuming there are no changes in AS membership of inet addresses over time).

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-11-11

I've made a graph of paths. It occurred to me that it might be possible to build a similar graph using BGP-LS, but we'd need a separate daemon on the target network that we can communicate with to build the graph, since streamevmon itself can run anywhere as long as it's given a data input stream.

One thing that's currently unsupported is using the RTT of traceroute measurements to add edge weights to the graph, due to some confusion about the schema. This doesn't strike me as a big deal though, since the number of hops is a perfectly good distance heuristic for our purposes.

I'm going to try figure out a structure for the actual topological event correlations. It will probably require some structural changes in the runners, like providing the correlator with a mapping of stream IDs to metadata so it can know where on the graph an event comes from.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-11-16

I've got a MeasurementMetaExtractor which takes and passes through a stream of measurements, as well as giving a side output for any new relevant MeasurementMeta entries, such that each Meta will only be produced once. From here, I have a couple of options on how I want to use the graph:

We don't have a subscription for regular Traceroute measurements, since they're in Postgres. We might want a new source that does this! We should also make sure Traceroute measurements don't extend InfluxMeasurement.

I'm going to work this out on paper again, my thoughts are all tangled up.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-11-16

We don't actually have a regular SourceFunction for traceroute measurements, so we can't yet build a graph as they come. If we were to use the periodic refresh method, the best approach would be to have a timer which periodically fetches new traceroute measurements, obtains the paths they refer to, and passes the resulting AsInetPaths to the GraphBuilder which can add them to the existing graph. Funnily enough, this sounds an awful lot like a series of Flink operators, starting with a SourceFunction that queries PostgreSQL for new traceroute measurements.

We will need to adjust the GraphBuilder to add new paths to an existing graph, as well as probably removing the references to GraphWalks in Hosts so that the graph can be serialised properly.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-11-23

The graph, and its requisite data-extraction dependencies, are now proper Flink operators. Hosts knowing about their GraphWalks is gone, and checkpoints appear to work. Next up is tidying, documenting, and testing before I move on to the actual grouping phase.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-12-04

A data mining analysis of RTID alarms (Stefanos Manganarisa, Marvin Christensena, Dan Zerklea, Keith Hermiz) is a paper describing the functionality of IBM's Intelligent Miner for Data, a product offered until around 2002 which is not well-detailed on the modern internet.

The miner uses a system to discover useful insights from large quantities of alarm data from many client networks. First, they perform additional layers of anomaly detection on the alarm data, which is tuned to discover variations from the sensor's "normal" patterns of alarm data. Second, they perform additional analysis of each sensor's outputs and metadata to group these sensors into distinct categories.

They had a very good level of success with this system, detecting all major events that their human operators flagged, as well as a few more that they determined to be major events. Their sensor grouping analysis also produced effective results, determining that there were around five distinct groups of sensors. This included a large (78%) group of standard sensors, as well as smaller groups. Two such groups appeared to be comprised of sensors inspecting intranet and internet traffic. One group was a network with particularly unique characteristics, and the other was a group of sensors that they determined could have their performance improved with better tuning.

Unfortunately, while these results are very positive, there appears to be no way to currently access the system or its code. Since it was an IBM product, it was proprietary, and the only remnants appear to be manuals and changelogs. The product appears to have evolved into a cloud-based ML platform, but doesn't really show any of its roots. The concepts described in the paper are still likely to be applicable, but there will be no example of results or specific algorithms to base these applications on.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-12-14

44 relates to using the CAIDA ITDK dataset to perform advanced alias resolution using a large store of known aliases. While implementing this, we came across a case where self-loops are created when it is determined that two addresses that have a known link between them are actually on the same host. We've chosen to drop these self-loops, since we're building a graph of network topology. Certain algorithms might find self-loops useful, since they are a real hop that shows up in traceroute measurements, but they are likely better served by looking at the measurements instead of the graph.

danoost commented 3 years ago

We have a functioning topology graph that should be reasonably up-to-date with the real situation at any given time.

The next step is to place events on the graph.

In terms of distributing the graph to downstream operators that want to interact with it, there seems to be one obvious method with a few variations. Flink supports Broadcast State, which allows passing state to every downstream operator, regardless of its parallelism.

Both of these approaches have the issue of each operator keeping a copy of the graph in-memory. Since Flink prefers to assume that each operator could be on a separate JVM, the closest we have to avoiding this problem is the state system, which doesn't really come anywhere close. It doesn't seem like a good solution to retain state in a shared memory location, like an object with a volatile field. This would become difficult to manage in the case where there are actually multiple JVMs, and it breaks a lot of the assumptions Flink makes. It doesn't look like there's a great way to use any shared APIs unless I want to ship the entire graph to a separate daemon, which seems like overkill most of the time.

It seems as though the change-event system will be the most elegant and Flinky. We should implement this in TraceroutePathGraph first, because it'll need to be the one sending the events. I'm unsure what state storage will look like though. If any of the downstream operators need to be recreated, they'll have no way to request an up-to-date graph from upstream, so they'll need to store the entire graph state regardless :( It will reduce the amount of bandwidth between operators though, which is probably worth the effort.

danoost commented 3 years ago

I've got the "stream of graph change events" system working, in the nz.net.wand.streamevmon.events.grouping.graph package. It took quite a bit of work to iron out all the bugs, but it seems to be a lot better now. It does introduce some extra complexity in the following ways:

With the switch to amp2 measurements, the existing work on event stream distance will need a bit of work, as well as the graph building itself.