Update From Discussions

Benny gaat nadenken over wat wel/niet gaat lukken met oa overstap naar xtdb2 en versnellen van octopoesv2. Wachten tot Benny terug is.

TODO:

Create an overview of the issue, some refinement and ideas
Meet with devs and stakeholders about priorities and refinement tickets
Implement concrete OOI Type filters in Octopoes

Advanced Octopoes Queries

Currently, external services query XTDB through Octopoes API, while Octopoes queries XTDB directly. There are, however, several limitations to the current implementation. This issue aims to capture the current limitations and create a plan to improve query flexibility to:

Improve performance drastically by reducing or removing filtering/grouping in memory altogether
Make it easy to show only relevant information to the user, increasing productivity and usability
Give developers an better interface for doing more involved aggregations that is less error-prone

User stories

The user stories could be:

As a KAT user, I want to filter Objects on type-specific fields, so I can easily find what I'm interested in.
As a KAT user, I want to do aggregation queries on my object graph, so I can easily report on totals, averages and counts for objects and findings.
As a KAT user, I want to know both when facts where valid as well as when facts were recorded, so I can create a detailed audit trail/log for reports to clients and auditors.

Query Limitations

API Limitations

The current Octopoes API implements several HTTP endpoints:

Endpoint	Methods
"/health"	["get"]
"/{client}/health"	["get"]
"/{client}/objects"	["get"]
"/{client}/objects/load_bulk"	["post"]
"/{client}/object"	["get"]
"/{client}/objects/random"	["get"]
"/{client}/"	["delete"]
"/{client}/objects/delete_many"	["post"]
"/{client}/tree"	["get"]
"/{client}/origins"	["get"]
"/{client}/origin_parameters"	["get"]
"/{client}/observations"	["post"]
"/{client}/declarations"	["post"]
"/{client}/scan_profiles"	["get", "put"]
"/{client}/scan_profiles/save_many"	["post"]
"/{client}/scan_profiles/recalculate"	["get"]
"/{client}/scan_profiles/inheritance"	["get"]
"/{client}/findings"	["get"]
"/{client}/findings/count_by_severity"	["get"]
"/{client}/node"	["post", "delete"]
"/{client}/bits/recalculate"	["post"]

In total 10 out of the 23 endpoints are dedicated to fetching one of the XTDB entities:

OOI (4)
Origin (1)
OriginParameter (1)
Finding (2)
ScanProfile (2)

Every endpoint supports valid-time filtering and some have specialized filters, such as the GET objects, that you can filter on e.g. type and scan level, or GET Findings, with a filter on severity.

This poses the following issues:

There is no way to filter on OOIType-specific fields with the objects endpoint. Now, you cannot find all open IpPort|80 for an organization, for example.
Endpoints have to be created to do aggregations, such as for the Findings count_by_severity
There are 4 ways to fetch generic OOIs
There are no DELETE endpoints for several entity types
On a side-note: there is no real object history exposed through the API, although even the current XTDB version does have some neat APIs already that would be interesting to expose.
There is no transaction time query support yet

Possible API Solutions

To resolve issue 1. we could consider a few options:

We could generate endpoints dynamically for the schema, perhaps adding either DRF-like-serializer models or trying to do this generically for the known types in the schema. We would have endpoints for every specific OOIType.
Add generic filters to the objects endpoint. But given the potential XTDB query complexity, we would have to think carefully about url-encoding such a query. This was recently changed in Mula because of some difficulties with that as well.
Manually add endpoints for each OOIType.
Provide a more transparant proxy to XTDB where we allow (certain) users to pass complex queries directly to XTDB. A direct connection to XTDB would a simple alternative for this

To resolve issue 2. we could consider:

Building custom endpoints per needed aggregation (type?), collecting re-usable logic into perhaps some serializer model and always specifying generic JSON as a return type (as aggregations could be composed into result sets of an arbitrary schema).
A transparant proxy/direct access to XTDB (again).

To resolve issue 3. we could probably phase out the random endpoint at some point, and somehow the tree endpoint might be a special case of the GET objects endpoint. This would also be resolved with the xtdb proxy/connection.

Resolving issue 4. is a matter of completing the implementation of one of the proposed solutions properly.

ORM Limitations

Within Octopoes, the OOIs are both saved and queries directly from XTDB. With the current setup, there are some issues given the current requirements and developments:

Still a significant amount of queries are built as string interpolations, which makes it hard to create a more generic interface around query building.
There is no functionality to filter OOIType's on type specific fields in the ORM either.
The generate_pull_query is quite complex and still exposes quite some query complexity
There is no aggregation functionality built into the ORM
It is quite hard to do joins via abstract types.
There is also no functionality to fetch the object's history at the ORM level.
There is no transaction time query support yet

Possible ORM Solutions

The new Query object provides some interesting possibilities here to resolve 1. to 4. . Let me also shamelessly plug the package I've built that tries to introduce proper abstractions between writing generic Datalog queries in Python and creating an ORM with an easy API for users to perform more involved queries, at least for some inspiration.
It is worth considering if we could restrict query APIs to concrete types only to resolve all the or-statements polluting the current queries.
We might want to extend the support for the API spec in the HTTP client to leverage in the ORM.

Approach

Assuming we will not be considering direct connections to XTDB from Rocky we can break the user stories down into the following issues.

As a KAT user, I want to filter Objects on type-specific fields, so I can easily find what I'm interested in.

[ ] Design a new filter component, perhaps some compact drop-down-multiselect for:
- OOI Type
- Clearance Level
- Clearance Type
- And a form generated based on the type, with its fields (e.g.: IPPort - port, Network - name, HTTPHeader - key etc.)
[ ] XTDBOOIRepository.list() does 2 queries, but this can (probably) be a single query
[ ] Get rid of current string interpolations in favor of a Query object / Datalog support
[ ] Extend the OOIRepository.list() interface to support type-specific field queries
[ ] Either extend the /objects API to support type-specific field queries or generate endoints per object type using serializers (see section API Solutions)
[ ] Extend the OctopoesConnector.list() interface to support type-specific field queries

As a KAT user, I want to do aggregation queries on my object graph, so I can easily report on totals, averages and counts for objects and findings.

[ ] Design a page or component where this information can be exposed
[ ] Add a OOIRepository.query() method and think about the interface, perhaps using a built Query object
[ ] Design and Develop an Octopoes API to support more involved queries [to be refined]
[ ] Propagate the functionality to Rocky through the OctopoesConnector

As a KAT user, I want to know both when facts where valid as well as when facts were recorded, so I can create a detailed audit trail/log for reports to clients and auditors.

[ ] Add a transaction time filter component to the objects filters
[ ] Propagate this field everywhere we use valid-time as well.
[ ] Expose this information in a report or log. [to be refined]
[ ] Perhaps create separate reports/logs for this information to expose as a download in Rocky. [to be refined]

minvws / nl-kat-coordination