originalsouth commented 7 months ago

The Future of Octopoes

Introduction

Octopoes is the very core of OpenKAT's functionality. It contains all the relevant data gathered whether by declarations (by user input), observation (by Boefjes), inference (by BIT's), or affirmation (by Hydration); together with a possible origin and scan-profile data. Data objects, so called OOI's, origins, and scan-profile data, are modelled in Octopoes by structures that carry the relevant data in field together with references so to point to other data objects. Hence, together these data objects form a graph. On this graph logic can be applied, such as logical inference (through BIT's), neighbor queries, path queries etc. The reason OpenKAT needs this functionality is because OpenKAT seeks to find all relevant combinatorical "attack" vectors of an object under analysis. Currently, octopoes offloads its data objects to XTDB which is essentially a bitemporal and referencing layer on a database storage implementation like RocksDB.

At the core, the architecture of Octopoes works as follows: the Octopoes server, controlled by a client, handles ingesting and querying OOI's. All the graph and inference logic is done within Octopoes and queries (with possible reference linking) are done on XTDB. XTDB also works with a client--server model. OpenKAT uses the HTTP interface that either uses JSON or EDN. All advanced queries are done using EDN over HTTP.

The implementation of Octopoes has been very successful as it is the main building block of OpenKAT. As the OpenKAT ecosystem has heavily changed, needs, concerns, and problems have emerged from the current Octopoes implementation.

Needs, Concerns, and problems in Octopoes

Calculating BIT's is slow ~4s per ~10^2 OOI's for about ~30 BIT's.
BIT's might not be properly calculated requiring possible recalculation.
BIT's can not be enabled, disabled, or altered; they come preselected.
A large number of OOI's O(10^3) slows down Octopoes significantly.
XTDB is no so common database with little tooling.
XTDB is not a graph database.
XTDB is not a stable project.
Querying in XTDB using is non-trivial.
There is no trivial visualisation of XTDB nor Octopoes making it hard to assess what is actually going on within them (and within the graphs).

Learning from Octopoes

Assuming the average size of an OOI is 10kB, 10^6 OOI's would be ~10GB something an average modern computer can hold in work memory. Larger models could either be compartimentalized or memory mapped.
BIT's are typically tiny functions that make mappings from N OOI's to M OOI's, execution of a single bit should typically in within a microsecond.
A type system reduces the number of BIT's executed in a model.
BIT inferences can in principle be processed in parallel on lightweight processors like GPU's.
As BIT's describe transformation of OOI's they can easily be provided by a user using either
1. a lightweight embeddable scripting language eg. lua, javascript, micropython, daslang, etc.
2. a compiled shared object.
Visualisations of the OOI graphs is a vital tool for both development as well as enhancer for the user experience.
To minimize BIT inferences, subgraphs (ie. graphlets) can be "consolidated" and merged instead of recalculated.
Given a set of nodes and edges, graphs or graphlets can be easily visualized with various open source eg. GraphViz, Plotly, etc.

Novempuss

Novempuss is a prototype building forth on teaching of Octopoes with the aim of eventually bringing it to the next level. As a prototype, Novempuss is currently written in Julia and hosted at https://github.com/originalsouth/Novempuss.

It is designed to be an atomic version of Octopoes where graphlets (subgraphs) are worked on individually and combined. In principle be processed in parallel.

Architecture

Novempuss implements graphlets as essentially a set of nodes and directed edges, with the edges carrying generator data. Upon creation of a graphlets all the logical inferences driving is graph creation is inferred recursively and consolidated. Many graph operations can be implemented locally graphlet or a set of graphlets. Graphlets can be merged and differed (added and subtracted) to form new graphlets. A (logical) graph database, is then a collection of roots (initial nodes) and graphlets. Deriving information from the full graph, graph/path queries can be done on a graphlet formed by the relevant roots. Integrating this into the Julia ecosystem, making use of existing an existing common graph library, many routines are available for graphlets. In addition graphlets can easily be visualized, building on the existing plotting infrastructure.

This design addresses most the issues currently present in Octopoes:

On a simple laptop it runs about 8*10^4 BIT executions per second.
It consolidates the graphlets on creation.
It allows users to implement, add and remove BIT's.
It handles a large number of OOI's with ease (10^5).
It does not rely on any external database (like XTDB).
It can be visualized in various plotting libraries.

In addition:

It handles BIT's from N to M objects.
I makes use of a type system to reduce BIT executions.
It is not parallel implemented, but can be with minor modifications.
It allows users to implement, add and remove BIT's.
It uses/relies on community maintained graph routines.

Beyond Octopoes

Learning from both Octopoes and the Novumpuss prototype we can sketch a roadmap towards a new and improved Octopoes.

Although Graph databases are abundant, logical inference within a graph database seems rather unique to Octopoes/Novempuss (or is otherwise somewhat limited, see https://neo4j.com/labs/neosemantics/4.0/inference/). A directed graph can easily be implemented with by a set of nodes with a set of edges connecting these nodes. Origins can be registered as "meta data" next to the edges (as for instance a dictionary would). Logical inference is achieved by recursively running BIT's on the acquired objects. This is the essential core of a logical inference graph database. Querying is done by traversing the inferred graph according to query specifications. Novempuss optimizes the graph database by breaking up the graph into graphlets.

A roadmap

Migrating from Octopoes to a novel solution encompassing the findings of the Novempuss prototype will be sizable task where certain decisions will need to be made.

What language do we want to write the improved Octopoes in (see below)
Is XTDB still the database of choice or can we directly work out of memory, or perhaps additionally use RocksDB -- or similar -- for bigger systems?

With these questions answered a roadmap towards a improved Octopoes can be achieved as such:

[ ] Implement the Novempuss architecture (in language of choice)
- [ ] With documentation
- [ ] Testing with high coverage
- [ ] Benchmarking (for continuous optimization)
- [ ] Examples with visualizations
- [ ] With bindings to Python (if not Python already)
[ ] Adapt and integrate the KAT model and OOI's into the improved Octopoes
- [ ] Convert all the BIT's
- [ ] Rewrite the Octopoes client for the new implementation
[ ] Test and improve

Language

As mentioned the current implementation of Novempuss is written in Julia. This was done both for productivity and performance reasons as Julia is designed to optimize for both worlds, see https://scientificcoder.com/how-to-solve-the-two-language-problem. That said, ideally, writing a full fledged database optimized for performance should be done in properly compiled language such as C++ or Rust -- CUDA/OpenCL or the like, can be added for acceleration. Either languages have excellent graph support.

Implementing the database python can be done and there are even moderately fast libraries available like:

https://github.com/networkx/networkx (although see https://stackoverflow.com/questions/77274732/why-is-networkxs-dijkstra-faster-than-boosts/77275293)
https://github.com/Qiskit/rustworkx (written in Rust)

There are two issues to consider:

Python does not have a type system: as OOI's--BIT's are inherently typed significant speedups can be achieved by minimizing the number of BIT's without boiler-plating a type system upon Python.
Python is significantly slower than many alternatives see for instance https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/mandelbrot.html.

When performance is a concern, Python is far from ideal. This is why many Python libraries are written in C/C++/Rust and the like.

What could be done is forking https://github.com/Qiskit/rustworkx and adapt it to the needs, rustworkx itself is however written in Rust.

originalsouth commented 7 months ago

Today we mainly discussed mostly two things:

We want keep XTDB because of bitemporality reasons,
- Maybe we can write generic code that can talk to any (appropriate) database solution (with a connector), so to benchmark and the like,
We will explore the possibility of building our Novempuss model on top of https://github.com/Qiskit/rustworkx,
Explore possibilities of a "TikTok"/tic-toc model where we iteratively build on database queries.

originalsouth commented 7 months ago

Turns out that https://github.com/Qiskit/rustworkx is built on https://github.com/petgraph/petgraph, so we might as well fork directly on that -- a Rust project with a Python interface.

dekkers commented 7 months ago

Turns out that https://github.com/Qiskit/rustworkx is built on https://github.com/petgraph/petgraph, so we might as well fork directly on that -- a Rust project with a Python interface.

But then we would have to re-invent rustworkx to be able to use it from python? And do we really need to fork rustworkx instead of extending it without forking?

originalsouth commented 7 months ago

But then we would have to re-invent rustworkx to be able to use it from python?

Rustnetworkx is just a networkx interface for python built on petgraph. We will have to re-invent part of either one anyway in order built in the inference, merging routines, etc.

And do we really need to fork rustworkx instead of extending it without forking?

I think logical inference and merging is something beyond a graph library such as rustnetworkx or petgraph.

We could opt writing all our logic in python and have all graph routines in rustnetworkx/networkx. The bottlenecks are in the in the inferences, internal data-structures and queries, so I project that in that case we will reinvent Octopoes with some novel algorithms and some offloaded graph logic. I am fine with that route, there is no real need for anything, but if performance is a requirement, I would advice against it.

dekkers commented 7 months ago

And do we really need to fork rustworkx instead of extending it without forking?

I think logical inference and merging is something beyond a graph library such as rustnetworkx or petgraph.

That we need to write our own code to do this doesn't mean we have to fork the library. Why can't we just do it by for example adding our own traits in Rust or subclassing in Python? I don't understand why we can't use the existing graph datastructures and code and need to fork the library and make changes to them.

originalsouth commented 7 months ago

What I mean by "forking" is that we will have a rustnetworkx clone as a starting point and adapt it to our needs -- and offload all the applicable graph logic to petgraph. We will not have to maintain petgraph, nor rustnetworkx, if that is your concern. As I see it, the implementation will be very similar to Novempuss but then written in Rust with a python interface, similar to rustnetworkx.

r3boot commented 7 months ago

While I appreciate the idea of writing your own graph database, I fail to see how this can be used in a highly available and scalable production environment. Furthermore, I dont think that the scope of OpenKAT should contain its own database engine, nor the forking of libraries to make such a database engine work.

Or, to put it in other words, why is (investing in) novempuss better then using any of the currently available graph databases?

originalsouth commented 7 months ago

Or, to put it in other words, why is (investing in) novempuss better then using any of the currently available graph databases?

If you can find an off the shelve inferencing graph data base that suits our needs, we will be all ears.

r3boot commented 7 months ago

A list can be found on https://en.wikipedia.org/wiki/Graph_database, and I am missing a comparison of these systems with novempuss, which would help with the product selection.

Personally, I would like to know why writing your own database engine is more useful then learning/using an off-the-shelf component. Furthermore, this will be a daunting task, especially considering the scale at which KAT will be deployed. How can this product be scaled to multiple systems? How does it scale to multiple datacenters? And above all, how fast can we go to production using this system?

originalsouth commented 7 months ago

What we would like is an atomic graph databases on which various programmable "business rules" can do recursive inference. As far as we are aware there is no such solution on the market. The reason OpenKAT chose such a route is to -- at least theoretically -- delve through viable combinatorical attack vectors.

As for the latter concern, as this ticket mentions multiple times, the storage solution will be offloaded either to XTDB for bitemporality concerns, or directly to RocksDB which can in turn be distributed https://github.com/rockset/rocksdb-cloud or we can choose to use "buckets" directly... in my opinion, it is really not the concern of the matter.

r3boot commented 7 months ago

What other ways are there to implement these business rules? Why should OpenKAT be the pioneer for such a technique? What other ways are there to detect these attack vectors? And does the implementation of this feature warrant delays in getting KAT ready for large-scale scanning?

For instance, Neo4J does inference out of the box (eg: https://neo4j.com/labs/neosemantics/4.0/inference/) and has both horizontal and vertical scaling. Why is this tool unsuitable for what KAT is trying to achieve? Why is any other existing graph database not suitable for this task?

originalsouth commented 7 months ago

@r3boot, thanks for you all your questions and concerns. I invite you to read https://docs.openkat.nl/developer_documentation/octopoes_research.html as an elaboration of what OpenKAT wants with Octopoes/Novempuss (and beyond) and why this beneficial for such a "Knowledge Acquisition Tool".

We are currently working on making the Octopoes architecture such that it can handle a large number of objects with high throughput (optionally attached to a large clustered storage). This ticket is discussion ticket towards that goal.

Regarding neo4j, we are aware of its capabilities, and are taking it into consideration. At this point in time XTDB still has our preference because of its bi-temporality reasons we would like to support in the future.

minvws / nl-kat-coordination