neos / neos-development-collection

The unified repository containing the Neos core packages, used for Neos development.
https://www.neos.io/
GNU General Public License v3.0
260 stars 221 forks source link

Rework CatchUp mechanism #4746

Open bwaidelich opened 10 months ago

bwaidelich commented 10 months ago

The CatchUp mechanism was overhauled with #4289 but the current implementation still has some issues:

By default we use Scripts::executeCommandAsync() (using the SubprocessProjectionCatchUpTrigger) to trigger catch ups. This can lead to many sub requests if there are a lot of commands (e.g. multiple editors creating content simultaneously).

Apart from the eminent performance issue this can bring, those sub requests might be killed if the parent request ends (or due to process limits).

Update: After going back and forth I now suggest:

Considerations

Current architecture

sequenceDiagram
    actor Client
    Client->>CR: handle(cmd)
    activate CR
    CR->>EventPersister: publishEvents()
    activate EventPersister
    EventPersister->>EventStore: commit()
    activate EventStore
    EventStore-->>EventPersister: CommitResult
    deactivate EventStore
    loop
      EventPersister->>CatchUpTrigger: triggerCatchUp
      activate CatchUpTrigger
      rect rgba(200, 200, 200, .1)
        Note right of CatchUpTrigger: separate process
        CatchUpTrigger--)SubProcessCatchupCommandController: catchupCommand()
        activate SubProcessCatchupCommandController
        deactivate CatchUpTrigger
        SubProcessCatchupCommandController->>CR Registry: get()
        CR Registry-->>SubProcessCatchupCommandController: CR
        participant CR2 as CR
        SubProcessCatchupCommandController->>CR2: catchUpProjection()
        deactivate SubProcessCatchupCommandController
      end
    end
    deactivate EventPersister
    EventPersister-->>CR: CommandResult
    CR-->>Client: CommandResult
    deactivate CR
sequenceDiagram
    actor Client
    Client->>CR: catchUpProjection()
    activate CR
    CR->>EventStore: load()
    activate EventStore
    EventStore-->>CR: EventStream
    deactivate EventStore
    CR->>CheckpointStorage: acquireLock()
    activate CheckpointStorage
    CheckpointStorage-->>CR: SequenceNumber
    loop
      CR->>Projection: apply()
      CR->>CheckpointStorage: updateAndReleaseLock()
      CR->>CheckpointStorage: acquireLock()
      CheckpointStorage-->>CR: SequenceNumber
    end
    deactivate CheckpointStorage

Suggested architecture

sequenceDiagram
    actor Client
    Client->>CR: handle(cmd)
    activate CR
    CR->>EventPersister: publishEvents()
    activate EventPersister
    EventPersister->>EventStore: commit()
    activate EventStore
    EventStore-->>EventPersister: CommitResult
    deactivate EventStore
    EventPersister->>EventStore: load()
    activate EventStore
    EventStore-->>EventPersister: EventStream
    deactivate EventStore
    Note over EventPersister: acquire catch up lock
    loop
      EventPersister->>Projection: apply()
    end
    Note over EventPersister: release catch up lock
    deactivate EventPersister

The original ideas I wrote down in November 2023 when creating this issue:

Options

Switch to synchronous

It might be an option to switch to a synchronous catch-up mechanism but that has the great risk that it might lead to code that relies on that immediate consistent behavior – and thus breaks when scaled up to be used asynchronously.

=> Potential optimization for local dev and smaller projects, but dangerous and probably no general solution

Fork process

Instead of creating a fully fledged sub request, we could use PCNTL functions to fork the current process (along the lines of spatie/fork). Apparently this allows the scripts to stay alive even if the parent request was killed – but it works only on the CLI so we would still need at least one coordinating sub request.

=> Potential (optional!) performance improvement, but not a solution on its own

Expose locking state of CheckpointStorage and prevent needless catch-ups

The CheckpointStorageInterface provides means to acquire an exclusive lock in order to prevent events from being applied multiple times to the same subscriber. If we could expose the locking state, we could skip catch-ups that are already being processed.

Some considerations

How to expose lock state?

To ensure that events are only applied once, we use an exclusive write lock on the *_checkpoints tables (see DoctrineCheckpointStorage. AFAIK it is no possible to determine the lock state though without acquiring it..

Maybe we can greatly simplify the locking mechanism by using a kind of 2-phase-commit like Laravel does (see DatabaseLock for example).

"Queue" CatchUps

It won't suffice to skip a catch-up if it's already running because the events might have been read already. Instead the lock should consist of three states:

  • not acquired (= no catch-up in progress)
    • if this is the state, the catch-up should be invoked
  • acquired (= catch-up in progress)
    • if this is the state, the catch-up should be queued and invoked as soon as the lock is released
  • acquired & queued (= catch-up in progress, and queued for re-run)
    • if this is the state, we can safely skip the catch-up for the projection in question because it will be caught up anyways

Compare checkpoint sequence number

In some cases we could compare CommitResult::highestCommittedSequenceNumber with CheckpointStorageInterface::getHighestAppliedSequenceNumber() to skip catch-ups if the projection is already up-to-date. But this has to be handled with care because other commits might have been skipped under the assumption, that the catch-up is still queued!

Related: #4388

bwaidelich commented 10 months ago

A sequence diagram of the current flow for

ContentRepository::handle():

ESCR_catchup_01

and ContentRepository::catchUpProjection() that is triggered in a separate process:

ESCR_catchup_02

robertlemke commented 10 months ago

Thanks for the write-up! I didn't think your suggestion through yet, but wanted to leave one thought already. Using the fork-approach might be tricky, as I assume that most hosting providers will now have the pctnl extension enabled. So, if that's a hard requirements, people will have trouble using Neos in at a vanilla hosting provider.

kitsunet commented 10 months ago

Thanks for the write-up! I didn't think your suggestion through yet, but wanted to leave one thought already. Using the fork-approach might be tricky, as I assume that most hosting providers will now have the pctnl extension enabled. So, if that's a hard requirements, people will have trouble using Neos in at a vanilla hosting provider.

Funnily we do have it :D But yes, this is a separate issue and in the end doesn't really fix much for now, so we can probably think about it later.

bwaidelich commented 6 months ago

From todays Weekly: