minvws / nl-kat-coordination

Repo nl-kat-coordination for minvws
European Union Public License 1.2
123 stars 55 forks source link

Data inconsistencies in Octopoes: a proposal to retire celery and validate the model continuously #3585

Open originalsouth opened 3 days ago

originalsouth commented 3 days ago

Data inconsistencies in Octopoes: a proposal to retire celery and validate the model continuously.

Describe the bug Since VisualOctopoesStudio several bugs regarding Octopoes' the data model have come to light.

A subset of these bugs (#3498, #3564, and #3577), have addressed the various mechanisms of dangling self-proving OOI's that can occur like: image causing all kinds of bugs, like #3205.

With fixes for these bugs merged we still sporadically obtain such a self-proving OOI on the current main:

{
  "KATFindingType/description": "This nameserver does not have an IPv6 address.",
  "KATFindingType/id": "KAT-NAMESERVER-NO-IPV6",
  "KATFindingType/impact": "Some users may not be able to reach your website without IPv6 support.",
  "KATFindingType/primary_key": "KATFindingType|KAT-NAMESERVER-NO-IPV6",
  "KATFindingType/recommendation": "Ensure that the nameserver has an IPv6 address that can be reached.",
  "KATFindingType/risk_score": 0,
  "KATFindingType/risk_severity": "recommendation",
  "KATFindingType/source": "https://www.rfc-editor.org/rfc/rfc3901.txt",
  "object_type": "KATFindingType",
  "user_id": null,
  "xt/id": "KATFindingType|KAT-NAMESERVER-NO-IPV6"
}

With its history:

[
  {
    "contentHash": "3edeec9adadc2bbb8a7fa5d28c577ba46ba20881",
    "doc": {
      "KATFindingType/id": "KAT-NAMESERVER-NO-IPV6",
      "KATFindingType/primary_key": "KATFindingType|KAT-NAMESERVER-NO-IPV6",
      "KATFindingType/risk_score": 0.0,
      "KATFindingType/risk_severity": "pending",
      "object_type": "KATFindingType",
      "user_id": null,
      "xt/id": "KATFindingType|KAT-NAMESERVER-NO-IPV6"
    },
    "txId": 50,
    "txTime": "2024-09-26T08:24:18Z",
    "validTime": "2024-09-26T08:24:18Z"
  },
  {
    "contentHash": "3edeec9adadc2bbb8a7fa5d28c577ba46ba20881",
    "doc": {
      "KATFindingType/id": "KAT-NAMESERVER-NO-IPV6",
      "KATFindingType/primary_key": "KATFindingType|KAT-NAMESERVER-NO-IPV6",
      "KATFindingType/risk_score": 0.0,
      "KATFindingType/risk_severity": "pending",
      "object_type": "KATFindingType",
      "user_id": null,
      "xt/id": "KATFindingType|KAT-NAMESERVER-NO-IPV6"
    },
    "txId": 51,
    "txTime": "2024-09-26T08:24:18Z",
    "validTime": "2024-09-26T08:24:18Z"
  },
  {
    "contentHash": "3edeec9adadc2bbb8a7fa5d28c577ba46ba20881",
    "doc": {
      "KATFindingType/id": "KAT-NAMESERVER-NO-IPV6",
      "KATFindingType/primary_key": "KATFindingType|KAT-NAMESERVER-NO-IPV6",
      "KATFindingType/risk_score": 0.0,
      "KATFindingType/risk_severity": "pending",
      "object_type": "KATFindingType",
      "user_id": null,
      "xt/id": "KATFindingType|KAT-NAMESERVER-NO-IPV6"
    },
    "txId": 52,
    "txTime": "2024-09-26T08:24:18Z",
    "validTime": "2024-09-26T08:24:18Z"
  },
  {
    "contentHash": "0000000000000000000000000000000000000000",
    "doc": null,
    "txId": 93,
    "txTime": "2024-09-26T08:24:35Z",
    "validTime": "2024-09-26T08:24:26Z"
  },
  {
    "contentHash": "591cee413cbbb4bb17d003563afa0cd32e31dc41",
    "doc": {
      "KATFindingType/description": "This nameserver does not have an IPv6 address.",
      "KATFindingType/id": "KAT-NAMESERVER-NO-IPV6",
      "KATFindingType/impact": "Some users may not be able to reach your website without IPv6 support.",
      "KATFindingType/primary_key": "KATFindingType|KAT-NAMESERVER-NO-IPV6",
      "KATFindingType/recommendation": "Ensure that the nameserver has an IPv6 address that can be reached.",
      "KATFindingType/risk_score": 0.0,
      "KATFindingType/risk_severity": "recommendation",
      "KATFindingType/source": "https://www.rfc-editor.org/rfc/rfc3901.txt",
      "object_type": "KATFindingType",
      "user_id": null,
      "xt/id": "KATFindingType|KAT-NAMESERVER-NO-IPV6"
    },
    "txId": 66,
    "txTime": "2024-09-26T08:24:34Z",
    "validTime": "2024-09-26T08:24:27Z"
  }
] 

And the Origin's history (as there is only one transaction we show the Origin here implicitly):

[
  {
    "contentHash": "b7bd2966301c32f40b8e1dbdd36ccc8e40c3d540",
    "doc": {
      "method": "kat_kat_finding_types_normalize",
      "origin_type": "affirmation",
      "result": [
        "KATFindingType|KAT-NAMESERVER-NO-IPV6"
      ],
      "source": "KATFindingType|KAT-NAMESERVER-NO-IPV6",
      "source_method": "kat-finding-types",
      "task_id": "481e45e5-6dad-4f1e-8e5e-67c8442989aa",
      "type": "Origin",
      "xt/id": "Origin|affirmation|kat_kat_finding_types_normalize|kat-finding-types|KATFindingType|KAT-NAMESERVER-NO-IPV6"
    },
    "txId": 66,
    "txTime": "2024-09-26T08:24:34Z",
    "validTime": "2024-09-26T08:24:27Z"
  }
]

(note that XTDB transaction can contain multiple entities.)

In the history of the OOI there is something odd, namely that OOI there 9 seconds lag between it's validTime and the the txTime. This is cause by several factors playing:

  1. The definition of now as valid time is often ambiguous, sometimes it is aggregated all the way from Rocky and sometimes it generated impromptu -- causing a lapse time differences on operations and transactions.
  2. Some transactions in Octopoes are done impromptu in the main thread, and some are handled by event manager which invokes Celery, the Celery transaction are queued by parallel workers, these workers are as typical for parallel workers not synchronized.
  3. Affirmation of objects takes a relatively long time as they are processed by Boefjes as controlled by the scheduler.

What seems to be happening graphically is:

---
title: Transaction graph
---
%%{init: { 'gitGraph': { 'mainBranchName': 'Octopoes'}} }%%
gitGraph:
   commit id: "Initial OOI 0"
   commit id: "Initial OOI 1" type: HIGHLIGHT
   branch "Affirmation"
   commit id: "Affirmation queued"
   checkout "Octopoes"
   commit id: "OOI delete queued"
   branch "Deletion"
   checkout "Affirmation"
   commit id: "Affirmation processed"
   checkout "Octopoes"
   merge "Affirmation"
   commit id: "Affirmed OOI"
   checkout "Deletion"
   commit id: "OOI delete at queue time"

where the the timing of the deletion event and the affirmation are such that after deletion queuing (given the validTime), the OOI is affirmed (and by the affirmation implicitly recreated), only after which the deletion is executed (for that previously mentioned validTime).

Proposed resolution(s)

  1. Retire celery The event manager in Octopoes uses Celery a worker thread pool. Celery has been a source of issues within Octopoes, see for instance #2171 where the upstream Celery/Billiard issue remains untouched https://github.com/celery/billiard/issues/399. While Celery has nice features, it seems overkill for our case and a source of delay, in this case accumulating up to 9s. In order to mitigate the behavior we would like to have a fast thread pool that can work parallel but does not change the order of creation and deletion events on a similar "inference-spacetimeline" as this violates causality. In addition, we would like to Octopoes to be able to query the event queue, so it can block or reject certain finding based on issued deletion events. As far as we know Celery has no trivial way to query the queue as such. This can all be easily done with a custom thread pool implementation managed by Octopoes, retiring Celery, and thus we propose to do so.

  2. Validate the model continuously Similar to a filesystem, we ideally never have any errors but if errors occur we would like to have to tools to detect them, and possibly fix them. Currently we have neither in Octopoes. We propose to implement a thread that with low priority validates the current Octopoes state for (logical) inconsistencies, once found a user can opt to have them fixed automatically where possible or fix/mitigate the error. Such a tool within Octopoes will make OpenKAT both more reliable and transparent, additionally it is an excellent way for a OpenKAT system administrator to file well documented issues should such errors occur.

OpenKAT version main

dekkers commented 3 days ago

If I understand it correctly the problem is that an affirmation and deletion can happen at the same time and the affirmation can overwrite the deletion. Deleting is probably only part of the problem, because as far as I can see this can also happen if the affirmation and an update conflict, because an affirmation saves the whole object and potentially overwrites the data of the update.

This a pretty common problem with databases and concurrency and the usual solution is to use transactions to make sure the saved data is consistent. With XTDB we can do that using match in v1 or using ASSERT in v2. The match/assert should guard against an earlier/concurrent transaction doing conflicting changes. This should prevent saving an affirmation for an already deleted object.

Other than that I disagree that what celery currently does can be easily done with a threadpool, because we also need to take into account race conditions, resilience against crashes and scalability. Maybe it can be done with a threadpool, but I don't think we should think about it as something that is easy to do. Also note that a "fast thread pool that can work parallel" does not exist with Python if what is meant is executing Python code in parallel because of the GIL. And it will still take a few years before there is a Python without GIL that we can use...

originalsouth commented 3 days ago

Thanks @dekkers for you comment and concerns.

If I understand it correctly the problem is that an affirmation and deletion can happen at the same time and the affirmation can overwrite the deletion. Deleting is probably only part of the problem, because as far as I can see this can also happen if the affirmation and an update conflict, because an affirmation saves the whole object and potentially overwrites the data of the update.

Same time could somewhat be misleading, the point is more that causality, as in the order of transactions is not preserved by the mix of various mechanisms launched by Octopoes. Indeed affirmations resaves the whole OOI.

This a pretty common problem with databases and concurrency and the usual solution is to use transactions to make sure the saved data is consistent. With XTDB we can do that using match in v1 or using ASSERT in v2. The match/assert should guard against an earlier/concurrent transaction doing conflicting changes. This should prevent saving an affirmation for an already deleted object.

I am aware of the various "atomic" methods one can apply to prevent data being parallel modified. I do not see, however, how this solves our problem. Note that in this case the OOI is retroactively deleted in the past (from the future -- if that makes sense). It is more a problem of logic within Octopoes rather than putting a simple lock on a transaction, because an object can be legitimately deleted and then reintroduced. Fundamentally this logic has to be assessed by Octopoes.

Other than that I disagree that what celery currently does can be easily done with a threadpool, because we also need to take into account race conditions, resilience against crashes and scalability. Maybe it can be done with a threadpool, but I don't think we should think about it as something that is easy to do. Also note that a "fast thread pool that can work parallel" does not exist with Python if what is meant is executing Python code in parallel because of the GIL. And it will still take a few years before there is a Python without GIL that we can use...

While I agree that it is a terrible idea to write anything of this sort in Python, as stated many many times before. Celery has been a source of frustration throughout the Octopoes project -- other than my own experience -- this something I also gathered from various developers in the team. Apart from that, I do not see how can reduce the overhead in calls and the long delays in execution, query the queue, and manage the queue execution priority (as alluded to above) by transaction type. That said, it is my opinion that the GIL concern is somewhat limited to what can be done to address the issue... sure, we will not be truly parallel our thread pool but we can definitely make it be parallel enough as we are making calls to XTDB. Alternatively, if we want true parallelism, we can spawn or use any normal modern language other than Python that is actually suited for the task; (as also Celery/Billiard does). See also https://superfastpython.com/threadpool-python or particularly https://superfastpython.com/threadpool-python/#What_About_the_Global_Interpreter_Lock_GIL.

Thanks.