Event grouping #36

Open wandgitlabbot opened 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-10-02

It could be useful to have another layer of processing after detectors which will group events that are raised at similar times. Some notes:

Netevmon does this in two parts.
- eventing groups events within the same stream that happened within a certain amount of time of one another.
- cleanser aggregates groups that are overlapping or otherwise redundant. It's unclear whether this merges events from different streams.
- These run as individual processes, and store all their results in the shared database.
Initial thoughts lead me towards a clustering algorithm like DBSCAN or k-means, but these could well be more complex than needed. I'm not convinced we want multi-dimensional clustering.
- We could instead run a correlation detector over the various event input streams. This would tell us how similar the event patterns are over a certain time period.
- Even this could well be too complex. Detectors self-report an estimate of detection latency, so we could apply this to any detectors before temporal grouping.
Other anomaly detection systems like luminol, skyline, and OpenNMS alec are very complex and make it hard to determine what algorithms they're actually using. Regardless, they may still be useful resources for future development.

In GitLab, by Daniel Oosterwijk on 2020-10-04

Richard N mentioned that it should be useful to have topological grouping for alerts. We will have a number of similar streams, such as any amplet -> google, and it would be useful to group a proportion of these streams which are anomalous. We should also probably be investigating traceroute data, to give us the following grouping information:

Any amplet -> the same host
- Number of hops
- Correlation with other stream anomalies in this group
Any amplet -> any other amplet
- Number of hops - can infer some level of detail of the internal network from this. We will be able to discover which amplets are near each other, and hopefully what routes they share.
- Correlation with other anomalies in the same group - if we can determine that a section of the network is anomalous, or a particular link introduces anomalies, that will considerably help the usefulness of our alerts.

In GitLab, by Daniel Oosterwijk on 2020-10-12

I'm working on implementing AMP traceroute support in order to build a graph of paths between known amplets. Amplets can also provide AS mappings, so I won't bother implementing a way to look those up myself. I'd prefer to match the results given by amplets if I were to do that, but the API being used appears to need a key, and I don't want to bother setting up support for that yet. Instead, I'll just rely on whatever the amplet sends me, and if AS lookups were disabled on their end, we'll not bother providing them.

Each measurement has a path (aka inet path), and an optional AS path. Multiple inet paths can be paired a single AS path, but only one AS path can be paired with each inet path (assuming there are no changes in AS membership of inet addresses over time).

In GitLab, by Daniel Oosterwijk on 2020-11-11

I've made a graph of paths. It occurred to me that it might be possible to build a similar graph using BGP-LS, but we'd need a separate daemon on the target network that we can communicate with to build the graph, since streamevmon itself can run anywhere as long as it's given a data input stream.

One thing that's currently unsupported is using the RTT of traceroute measurements to add edge weights to the graph, due to some confusion about the schema. This doesn't strike me as a big deal though, since the number of hops is a perfectly good distance heuristic for our purposes.

I'm going to try figure out a structure for the actual topological event correlations. It will probably require some structural changes in the runners, like providing the correlator with a mapping of stream IDs to metadata so it can know where on the graph an event comes from.

In GitLab, by Daniel Oosterwijk on 2020-11-16

I've got a MeasurementMetaExtractor which takes and passes through a stream of measurements, as well as giving a side output for any new relevant MeasurementMeta entries, such that each Meta will only be produced once. From here, I have a couple of options on how I want to use the graph:

Add new paths to the graph as they come in
Periodically reconstruct the graph on a timer

We don't have a subscription for regular Traceroute measurements, since they're in Postgres. We might want a new source that does this! We should also make sure Traceroute measurements don't extend InfluxMeasurement.

I'm going to work this out on paper again, my thoughts are all tangled up.

Construct the graph purely based on new measurements that are passed in, ignoring any historical measurements from before the measurement stream started.
- This will require some refactoring of the GraphBuilder code to maintain an up-to-date graph. It will also need to be given all the measurements so that it can keep the paths paired properly.
- The new operator will also be passed the Event stream so that it can do grouping. Alternatively, it periodically passes its graph to a new grouping-only operator.
Periodically reconstruct the graph from [...] (the earlier points occurred to me here, so I'm leaving these points here in case they come in useful later)

In GitLab, by Daniel Oosterwijk on 2020-11-16

We don't actually have a regular SourceFunction for traceroute measurements, so we can't yet build a graph as they come. If we were to use the periodic refresh method, the best approach would be to have a timer which periodically fetches new traceroute measurements, obtains the paths they refer to, and passes the resulting AsInetPaths to the GraphBuilder which can add them to the existing graph. Funnily enough, this sounds an awful lot like a series of Flink operators, starting with a SourceFunction that queries PostgreSQL for new traceroute measurements.

We will need to adjust the GraphBuilder to add new paths to an existing graph, as well as probably removing the references to GraphWalks in Hosts so that the graph can be serialised properly.

In GitLab, by Daniel Oosterwijk on 2020-11-23

The graph, and its requisite data-extraction dependencies, are now proper Flink operators. Hosts knowing about their GraphWalks is gone, and checkpoints appear to work. Next up is tidying, documenting, and testing before I move on to the actual grouping phase.

In GitLab, by Daniel Oosterwijk on 2020-12-04

A data mining analysis of RTID alarms (Stefanos Manganarisa, Marvin Christensena, Dan Zerklea, Keith Hermiz) is a paper describing the functionality of IBM's Intelligent Miner for Data, a product offered until around 2002 which is not well-detailed on the modern internet.

The miner uses a system to discover useful insights from large quantities of alarm data from many client networks. First, they perform additional layers of anomaly detection on the alarm data, which is tuned to discover variations from the sensor's "normal" patterns of alarm data. Second, they perform additional analysis of each sensor's outputs and metadata to group these sensors into distinct categories.

They had a very good level of success with this system, detecting all major events that their human operators flagged, as well as a few more that they determined to be major events. Their sensor grouping analysis also produced effective results, determining that there were around five distinct groups of sensors. This included a large (78%) group of standard sensors, as well as smaller groups. Two such groups appeared to be comprised of sensors inspecting intranet and internet traffic. One group was a network with particularly unique characteristics, and the other was a group of sensors that they determined could have their performance improved with better tuning.

Unfortunately, while these results are very positive, there appears to be no way to currently access the system or its code. Since it was an IBM product, it was proprietary, and the only remnants appear to be manuals and changelogs. The product appears to have evolved into a cloud-based ML platform, but doesn't really show any of its roots. The concepts described in the paper are still likely to be applicable, but there will be no example of results or specific algorithms to base these applications on.

In GitLab, by Daniel Oosterwijk on 2020-12-14

44 relates to using the CAIDA ITDK dataset to perform advanced alias resolution using a large store of known aliases. While implementing this, we came across a case where self-loops are created when it is determined that two addresses that have a known link between them are actually on the same host. We've chosen to drop these self-loops, since we're building a graph of network topology. Certain algorithms might find self-loops useful, since they are a real hop that shows up in traceroute measurements, but they are likely better served by looking at the measurements instead of the graph.

We have a functioning topology graph that should be reasonably up-to-date with the real situation at any given time.

The next step is to place events on the graph.

Events have a stream ID
From the stream ID, we can usually get the source and destination of the measurement that the stream is comprised of
We should be able to do some grouping based on just the source and just the destination, to see if a problem is localised at all
With both the source and destination, and potentially some information from recent traceroutes to that destination, we should be able to narrow down a probable path that the measurement took through the network, regardless of its type.
Once we have a few path-type events, we can check them for overlap and treat them as nearby events. The more nodes in common, the more "nearby" they are to one another, plus it wouldn't hurt to have a metric for nodes that are near nodes in other paths.

In terms of distributing the graph to downstream operators that want to interact with it, there seems to be one obvious method with a few variations. Flink supports Broadcast State, which allows passing state to every downstream operator, regardless of its parallelism.

The TraceroutePathGraph operator already stores the entire graph in its local state, so it would be nice to avoid storing it multiple times if possible.
One option is to just throw the entire graph down the pipe as the output from TraceroutePathGraph, and turning it into a BroadcastStream.
- This is likely to have quite high serialization and bandwidth costs, on the order of ~1000x the data transfer of other operators.
- On the other hand, it's quite easy to reason about.
Another option is to have a system of graph change events which get sent down the pipe instead. This could replace the onNewHost/onUpdateHost system currently used by AliasResolver.resolve.
- This would result in lower bandwidth usage, but increase complexity in that any downstream operator which uses the graph must implement an interface that handles said events and retains an instance of the graph.

Both of these approaches have the issue of each operator keeping a copy of the graph in-memory. Since Flink prefers to assume that each operator could be on a separate JVM, the closest we have to avoiding this problem is the state system, which doesn't really come anywhere close. It doesn't seem like a good solution to retain state in a shared memory location, like an object with a volatile field. This would become difficult to manage in the case where there are actually multiple JVMs, and it breaks a lot of the assumptions Flink makes. It doesn't look like there's a great way to use any shared APIs unless I want to ship the entire graph to a separate daemon, which seems like overkill most of the time.

It seems as though the change-event system will be the most elegant and Flinky. We should implement this in TraceroutePathGraph first, because it'll need to be the one sending the events. I'm unsure what state storage will look like though. If any of the downstream operators need to be recreated, they'll have no way to request an up-to-date graph from upstream, so they'll need to store the entire graph state regardless :( It will reduce the amount of bandwidth between operators though, which is probably worth the effort.

I've got the "stream of graph change events" system working, in the nz.net.wand.streamevmon.events.grouping.graph package. It took quite a bit of work to iron out all the bugs, but it seems to be a lot better now. It does introduce some extra complexity in the following ways:

The graph is stored separately in every operator that needs it.
Another DataStream is needed for every operator - any operators that need more than one other stream of input can't be done tidily, since Flink only allows 2-input CoProcessFunctions.

With the switch to amp2 measurements, the existing work on event stream distance will need a bit of work, as well as the graph building itself.

Traceroute events come from a different source. This isn't a big deal, but worth noting.
Traceroute events now only include a single hop per measurement. These can be chained together, but some thought will need to be put into how to group all the hops into a single AsInetPath. In fact, that whole system might need to be redone - probably not, but perhaps. This should just be a rewrite of the TracerouteAsInetPathExtractor.
All measurements now include their respective metadata, effectively making them all RichMeasurements. This is a big step towards simplifying the Flink pipeline, since we no longer need a separate MeasurementMeta stream. However, this means we'll need to refactor the TopologicalDistanceGrouper and related classes. Most of the work here should be reusable.
The rest of the work on actually doing topological grouping is as yet unimplemented. I'm still going to merge it into master when my contract expires, since unfortunately I don't expect to get it finished before I leave, and it would be nicer for the branch to not be abandoned if any future work occurs on this project.

wandnz / streamevmon

Event grouping #36