Open originalsouth opened 7 months ago
Today we mainly discussed mostly two things:
Turns out that https://github.com/Qiskit/rustworkx is built on https://github.com/petgraph/petgraph, so we might as well fork directly on that -- a Rust project with a Python interface.
Turns out that https://github.com/Qiskit/rustworkx is built on https://github.com/petgraph/petgraph, so we might as well fork directly on that -- a Rust project with a Python interface.
But then we would have to re-invent rustworkx to be able to use it from python? And do we really need to fork rustworkx instead of extending it without forking?
But then we would have to re-invent rustworkx to be able to use it from python?
Rustnetworkx is just a networkx interface for python built on petgraph. We will have to re-invent part of either one anyway in order built in the inference, merging routines, etc.
And do we really need to fork rustworkx instead of extending it without forking?
I think logical inference and merging is something beyond a graph library such as rustnetworkx or petgraph.
We could opt writing all our logic in python and have all graph routines in rustnetworkx/networkx. The bottlenecks are in the in the inferences, internal data-structures and queries, so I project that in that case we will reinvent Octopoes with some novel algorithms and some offloaded graph logic. I am fine with that route, there is no real need for anything, but if performance is a requirement, I would advice against it.
And do we really need to fork rustworkx instead of extending it without forking?
I think logical inference and merging is something beyond a graph library such as rustnetworkx or petgraph.
That we need to write our own code to do this doesn't mean we have to fork the library. Why can't we just do it by for example adding our own traits in Rust or subclassing in Python? I don't understand why we can't use the existing graph datastructures and code and need to fork the library and make changes to them.
What I mean by "forking" is that we will have a rustnetworkx clone as a starting point and adapt it to our needs -- and offload all the applicable graph logic to petgraph. We will not have to maintain petgraph, nor rustnetworkx, if that is your concern. As I see it, the implementation will be very similar to Novempuss but then written in Rust with a python interface, similar to rustnetworkx.
While I appreciate the idea of writing your own graph database, I fail to see how this can be used in a highly available and scalable production environment. Furthermore, I dont think that the scope of OpenKAT should contain its own database engine, nor the forking of libraries to make such a database engine work.
Or, to put it in other words, why is (investing in) novempuss better then using any of the currently available graph databases?
Or, to put it in other words, why is (investing in) novempuss better then using any of the currently available graph databases?
If you can find an off the shelve inferencing graph data base that suits our needs, we will be all ears.
A list can be found on https://en.wikipedia.org/wiki/Graph_database, and I am missing a comparison of these systems with novempuss, which would help with the product selection.
Personally, I would like to know why writing your own database engine is more useful then learning/using an off-the-shelf component. Furthermore, this will be a daunting task, especially considering the scale at which KAT will be deployed. How can this product be scaled to multiple systems? How does it scale to multiple datacenters? And above all, how fast can we go to production using this system?
What we would like is an atomic graph databases on which various programmable "business rules" can do recursive inference. As far as we are aware there is no such solution on the market. The reason OpenKAT chose such a route is to -- at least theoretically -- delve through viable combinatorical attack vectors.
As for the latter concern, as this ticket mentions multiple times, the storage solution will be offloaded either to XTDB for bitemporality concerns, or directly to RocksDB which can in turn be distributed https://github.com/rockset/rocksdb-cloud or we can choose to use "buckets" directly... in my opinion, it is really not the concern of the matter.
What other ways are there to implement these business rules? Why should OpenKAT be the pioneer for such a technique? What other ways are there to detect these attack vectors? And does the implementation of this feature warrant delays in getting KAT ready for large-scale scanning?
For instance, Neo4J does inference out of the box (eg: https://neo4j.com/labs/neosemantics/4.0/inference/) and has both horizontal and vertical scaling. Why is this tool unsuitable for what KAT is trying to achieve? Why is any other existing graph database not suitable for this task?
@r3boot, thanks for you all your questions and concerns. I invite you to read https://docs.openkat.nl/developer_documentation/octopoes_research.html as an elaboration of what OpenKAT wants with Octopoes/Novempuss (and beyond) and why this beneficial for such a "Knowledge Acquisition Tool".
We are currently working on making the Octopoes architecture such that it can handle a large number of objects with high throughput (optionally attached to a large clustered storage). This ticket is discussion ticket towards that goal.
Regarding neo4j, we are aware of its capabilities, and are taking it into consideration. At this point in time XTDB still has our preference because of its bi-temporality reasons we would like to support in the future.
The Future of Octopoes
Introduction
Octopoes is the very core of OpenKAT's functionality. It contains all the relevant data gathered whether by declarations (by user input), observation (by Boefjes), inference (by BIT's), or affirmation (by Hydration); together with a possible origin and scan-profile data. Data objects, so called OOI's, origins, and scan-profile data, are modelled in Octopoes by structures that carry the relevant data in field together with references so to point to other data objects. Hence, together these data objects form a graph. On this graph logic can be applied, such as logical inference (through BIT's), neighbor queries, path queries etc. The reason OpenKAT needs this functionality is because OpenKAT seeks to find all relevant combinatorical "attack" vectors of an object under analysis. Currently, octopoes offloads its data objects to XTDB which is essentially a bitemporal and referencing layer on a database storage implementation like RocksDB.
At the core, the architecture of Octopoes works as follows: the Octopoes server, controlled by a client, handles ingesting and querying OOI's. All the graph and inference logic is done within Octopoes and queries (with possible reference linking) are done on XTDB. XTDB also works with a client--server model. OpenKAT uses the HTTP interface that either uses JSON or EDN. All advanced queries are done using EDN over HTTP.
The implementation of Octopoes has been very successful as it is the main building block of OpenKAT. As the OpenKAT ecosystem has heavily changed, needs, concerns, and problems have emerged from the current Octopoes implementation.
Needs, Concerns, and problems in Octopoes
Learning from Octopoes
Novempuss
Novempuss is a prototype building forth on teaching of Octopoes with the aim of eventually bringing it to the next level. As a prototype, Novempuss is currently written in Julia and hosted at https://github.com/originalsouth/Novempuss.
It is designed to be an atomic version of Octopoes where graphlets (subgraphs) are worked on individually and combined. In principle be processed in parallel.
Architecture
Novempuss implements graphlets as essentially a set of nodes and directed edges, with the edges carrying generator data. Upon creation of a graphlets all the logical inferences driving is graph creation is inferred recursively and consolidated. Many graph operations can be implemented locally graphlet or a set of graphlets. Graphlets can be merged and differed (added and subtracted) to form new graphlets. A (logical) graph database, is then a collection of roots (initial nodes) and graphlets. Deriving information from the full graph, graph/path queries can be done on a graphlet formed by the relevant roots. Integrating this into the Julia ecosystem, making use of existing an existing common graph library, many routines are available for graphlets. In addition graphlets can easily be visualized, building on the existing plotting infrastructure.
This design addresses most the issues currently present in Octopoes:
In addition:
Beyond Octopoes
Learning from both Octopoes and the Novumpuss prototype we can sketch a roadmap towards a new and improved Octopoes.
Although Graph databases are abundant, logical inference within a graph database seems rather unique to Octopoes/Novempuss (or is otherwise somewhat limited, see https://neo4j.com/labs/neosemantics/4.0/inference/). A directed graph can easily be implemented with by a set of nodes with a set of edges connecting these nodes. Origins can be registered as "meta data" next to the edges (as for instance a dictionary would). Logical inference is achieved by recursively running BIT's on the acquired objects. This is the essential core of a logical inference graph database. Querying is done by traversing the inferred graph according to query specifications. Novempuss optimizes the graph database by breaking up the graph into graphlets.
A roadmap
Migrating from Octopoes to a novel solution encompassing the findings of the Novempuss prototype will be sizable task where certain decisions will need to be made.
With these questions answered a roadmap towards a improved Octopoes can be achieved as such:
Language
As mentioned the current implementation of Novempuss is written in Julia. This was done both for productivity and performance reasons as Julia is designed to optimize for both worlds, see https://scientificcoder.com/how-to-solve-the-two-language-problem. That said, ideally, writing a full fledged database optimized for performance should be done in properly compiled language such as C++ or Rust -- CUDA/OpenCL or the like, can be added for acceleration. Either languages have excellent graph support.
Implementing the database python can be done and there are even moderately fast libraries available like:
There are two issues to consider:
When performance is a concern, Python is far from ideal. This is why many Python libraries are written in C/C++/Rust and the like.
What could be done is forking https://github.com/Qiskit/rustworkx and adapt it to the needs, rustworkx itself is however written in Rust.