We should use a ReportOOI entity per Report, or update a Report in-place and traverse the history API

Donnype commented 1 month ago

To be discussed

Given some new idea/input on reporting, and especially given the fact that we want to re-use ReportOOIs as per the conclusion at the bottom of this ticket, some practical issues have arisen from the current data model. The parent-report vs. subreport dynamic is a bit strange in the new scenario because the parent-child relation if no longer one-to-many, but many-to-many. This is because I could collect the same dns-report on example.org in multiple parent reports. Also, would we have to re-use parent OOIs as well? This only makes sense if the parent points to (a list of) child reports.

The question arises: should we refactor the ReportOOIs data model? And should we somehow refactor so we could stop checking if we should create a parent ReportOOI or not? We could perhaps do that by making every Report have a parent, or always have a list of "child" reports that in some cases is just 1 report.

(My 2 cents still: do not re-use ReportOOIs :P)

Per Report, we now create a ReportOOI:

class Report(OOI):
    object_type: Literal["Report"] = "Report"

    name: str
    report_type: str
    template: str | None = None
    date_generated: datetime

    input_oois: list[str]

    report_id: UUID

    organization_code: str
    organization_name: str
    organization_tags: list[str]
    data_raw_id: str

    observed_at: datetime
    parent_report: Reference | None = ReferenceField("Report", default=None)
    report_recipe: Reference | None = ReferenceField("ReportRecipe", default=None)
    has_parent: bool

# as a reference:
class ReportRecipe(OOI):
    object_type: Literal["ReportRecipe"] = "ReportRecipe"

    recipe_id: UUID

    report_name_format: str
    subreport_name_format: str | None = None

    input_recipe: dict[str, Any]  # can contain a query which maintains a live set of OOIs or manually picked OOIs.
    parent_report_type: str | None = None
    report_types: list[str]

    cron_expression: str

    _natural_key_attrs = ["recipe_id"]

This means that every run of a recipe, we add a new object to XTDB (and Bytes) that has its own history, which shows it as a seperate row in the current table overview.

Another option is to update a ReportOOI in-place, reducing the list but meaning we should query the history API more often. The question is if we should change the implementation to perform

Pros and Cons

Pros

We have a smaller reports overview page.
Seeing changes in a single over time is more natively supported through the history API.
We could re-use this per-entity history view logic for OOI detail pages.

Cons

(@Donnype: When I try to model for XTDB, I first ask myself if it makes sense to save an entity over time in a regular relational database, which in the case of a Report I have to answer with "Yes". Older versions of reports are too important to hide in a historic version.)

Any query involving looking for older versions of reports become more complicated and/or slow(er):
- "Give me the reports generated between 10-10-2024 and 21-10-2024" implies fetching the history of every report.
You always need to provide timestamps or timestamped urls when sharing or talking about historic reports.
You cannot provide a proper per-generated-report overview (i.e. query across historic reports) unless you start filtering and querying in memory over the history API.

Conclusion after discussion on 25-10-2024

After a vote we concluded that we will re-use the Report OOI. We expect XTDB 2.0 to be able to resolve any use-cases that pop up and use the history API to traverse any other queries/use-cases. Both @dekkers and @underdarknl think using the history for new versions of a Report is a more adequate representation of the actual situation. (@Donnype thinks creating new objects is a more adequate representation that would also save us from potentially not being able to handle more intricate queries across reports.)

originalsouth commented 1 month ago

Give me the reports generated between 10-10-2024 and 21-10-2024" implies fetching the history of every report.

If I understand correctly, only if the result changes between the two valid_times which can be queried.

Donnype commented 1 week ago

Questions:

I suppose we can update the input_ooi field when reusing the report and see what were the inputs over time?
A report recipe does not have an organization code/name/etc. Is it safe to assume that since it can only live in 1 XTDB database that it refers to only 1 organization?
Are these the fields we should filter on to get the right Report to update? Writing this down I just remembered once drawing the conclusion that the "ReportRecipe -> Report" relation would be one-to-one. Looking at the code I should correct that and say that this only holds for "ReportRecipe -> ParentReport" when we have multiple reports. So the steps to update the reports in place would be:
1. Determine if there are multiple reports by looking at the report_data
2. If there are multiple, find the parent report that points to this ReportRecipe, else search for a single report that points to this recipe
3. Find all subreports of this parent report if we have a parent
4. Based on the report type and input ooi, try to find the subreports that already exist to update those in-place as well
Or should we not update subreports in-place? It might be strange that the amount of updates vs. creates happening during reporting can depend on the input oois that result from the input query.

underdarknl commented 1 week ago

Im looking at this in a different Light I think:

An aggregate report is the result of the underlying reports being combined. We currently have the relation the wrong way round I think.. A recipe holds a list of Input OOI's, or a query (resulting in a live list of input OOI's possibly changing over time). The reporting job derived from the recipe's schedule on a given moment in time produces a set of Asset-reports (input-ooi * report-type). those reports should get a reference to the recipe that was used to create them, and a reference to the job. (no news there I suppose?). The asset-reports should get a (I think) deterministic OOI-ID, which encodes the input-ooi, the used report type, and any settings that change the actual data being stored (settings that only apply to the rendered version, but not the json dont need to be part of the hash/ooi-ID. [1] The aggregate report should now hold the underlying report-ooi's OOI-ID's and their valid-time at the time of combination as its input-OOI list. It could also hold a reference to the recipe or job that triggered the aggregation to run. If a given OOI is no longer available for the asset-reporting, we could include the most recent report or decide to not include the Asset-report anymore. The underlying report might have been generated yesterday, or might have been manually uploaded by a user, in any case we do or don't include it in the input-ooi's for the given agregate-report.

1: This means, multiple recipe's and reporting jobs might write into the same report-OOI at different intervals. This would mean we are making more 'snapshots' of the data at those points.

minvws / nl-kat-coordination