Open originalsouth opened 3 days ago
If I understand it correctly the problem is that an affirmation and deletion can happen at the same time and the affirmation can overwrite the deletion. Deleting is probably only part of the problem, because as far as I can see this can also happen if the affirmation and an update conflict, because an affirmation saves the whole object and potentially overwrites the data of the update.
This a pretty common problem with databases and concurrency and the usual solution is to use transactions to make sure the saved data is consistent. With XTDB we can do that using match in v1 or using ASSERT in v2. The match/assert should guard against an earlier/concurrent transaction doing conflicting changes. This should prevent saving an affirmation for an already deleted object.
Other than that I disagree that what celery currently does can be easily done with a threadpool, because we also need to take into account race conditions, resilience against crashes and scalability. Maybe it can be done with a threadpool, but I don't think we should think about it as something that is easy to do. Also note that a "fast thread pool that can work parallel" does not exist with Python if what is meant is executing Python code in parallel because of the GIL. And it will still take a few years before there is a Python without GIL that we can use...
Thanks @dekkers for you comment and concerns.
If I understand it correctly the problem is that an affirmation and deletion can happen at the same time and the affirmation can overwrite the deletion. Deleting is probably only part of the problem, because as far as I can see this can also happen if the affirmation and an update conflict, because an affirmation saves the whole object and potentially overwrites the data of the update.
Same time could somewhat be misleading, the point is more that causality, as in the order of transactions is not preserved by the mix of various mechanisms launched by Octopoes. Indeed affirmations resaves the whole OOI.
This a pretty common problem with databases and concurrency and the usual solution is to use transactions to make sure the saved data is consistent. With XTDB we can do that using match in v1 or using ASSERT in v2. The match/assert should guard against an earlier/concurrent transaction doing conflicting changes. This should prevent saving an affirmation for an already deleted object.
I am aware of the various "atomic" methods one can apply to prevent data being parallel modified. I do not see, however, how this solves our problem. Note that in this case the OOI is retroactively deleted in the past (from the future -- if that makes sense). It is more a problem of logic within Octopoes rather than putting a simple lock on a transaction, because an object can be legitimately deleted and then reintroduced. Fundamentally this logic has to be assessed by Octopoes.
Other than that I disagree that what celery currently does can be easily done with a threadpool, because we also need to take into account race conditions, resilience against crashes and scalability. Maybe it can be done with a threadpool, but I don't think we should think about it as something that is easy to do. Also note that a "fast thread pool that can work parallel" does not exist with Python if what is meant is executing Python code in parallel because of the GIL. And it will still take a few years before there is a Python without GIL that we can use...
While I agree that it is a terrible idea to write anything of this sort in Python, as stated many many times before. Celery has been a source of frustration throughout the Octopoes project -- other than my own experience -- this something I also gathered from various developers in the team. Apart from that, I do not see how can reduce the overhead in calls and the long delays in execution, query the queue, and manage the queue execution priority (as alluded to above) by transaction type. That said, it is my opinion that the GIL concern is somewhat limited to what can be done to address the issue... sure, we will not be truly parallel our thread pool but we can definitely make it be parallel enough as we are making calls to XTDB. Alternatively, if we want true parallelism, we can spawn or use any normal modern language other than Python that is actually suited for the task; (as also Celery/Billiard does). See also https://superfastpython.com/threadpool-python or particularly https://superfastpython.com/threadpool-python/#What_About_the_Global_Interpreter_Lock_GIL.
Thanks.
Data inconsistencies in Octopoes: a proposal to retire celery and validate the model continuously.
Describe the bug Since VisualOctopoesStudio several bugs regarding Octopoes' the data model have come to light.
A subset of these bugs (#3498, #3564, and #3577), have addressed the various mechanisms of dangling self-proving OOI's that can occur like: causing all kinds of bugs, like #3205.
With fixes for these bugs merged we still sporadically obtain such a self-proving OOI on the current main:
With its history:
And the Origin's history (as there is only one transaction we show the Origin here implicitly):
(note that XTDB transaction can contain multiple entities.)
In the history of the OOI there is something odd, namely that OOI there 9 seconds lag between it's
validTime
and the thetxTime
. This is cause by several factors playing:What seems to be happening graphically is:
where the the timing of the deletion event and the affirmation are such that after deletion queuing (given the validTime), the OOI is affirmed (and by the affirmation implicitly recreated), only after which the deletion is executed (for that previously mentioned validTime).
Proposed resolution(s)
Retire celery The event manager in Octopoes uses Celery a worker thread pool. Celery has been a source of issues within Octopoes, see for instance #2171 where the upstream Celery/Billiard issue remains untouched https://github.com/celery/billiard/issues/399. While Celery has nice features, it seems overkill for our case and a source of delay, in this case accumulating up to 9s. In order to mitigate the behavior we would like to have a fast thread pool that can work parallel but does not change the order of creation and deletion events on a similar "inference-spacetimeline" as this violates causality. In addition, we would like to Octopoes to be able to query the event queue, so it can block or reject certain finding based on issued deletion events. As far as we know Celery has no trivial way to query the queue as such. This can all be easily done with a custom thread pool implementation managed by Octopoes, retiring Celery, and thus we propose to do so.
Validate the model continuously Similar to a filesystem, we ideally never have any errors but if errors occur we would like to have to tools to detect them, and possibly fix them. Currently we have neither in Octopoes. We propose to implement a thread that with low priority validates the current Octopoes state for (logical) inconsistencies, once found a user can opt to have them fixed automatically where possible or fix/mitigate the error. Such a tool within Octopoes will make OpenKAT both more reliable and transparent, additionally it is an excellent way for a OpenKAT system administrator to file well documented issues should such errors occur.
OpenKAT version main