prooph / pdo-event-store

PDO implementation of ProophEventStore http://getprooph.org
BSD 3-Clause "New" or "Revised" License
111 stars 56 forks source link

Calling stop() can cause projections to loose stream positions #233

Open fritz-gerneth opened 3 years ago

fritz-gerneth commented 3 years ago

I have observed this issue over some time, but had a hard time reproducing it. In rare instances during deployments, projections could loose their stream positions and start from the beginning. This can happen in instances where a pcntl stop handler invokes a projection's stop() method before the projection has loaded the stream positions.

The problematic section is in the run() method, before the actual main loop:

    // Initial State is ProjectionStatus::IDLE() 
   // $this->streamPositions is an empty array

        if (! $this->projectionExists()) {
            $this->createProjection();
        }

        $this->acquireLock();

        if (! $this->readModel->isInitialized()) {
            $this->readModel->init();
        }

        $this->prepareStreamPositions();
        $this->load();                            // << Only here $this->streamPositions will be initialized

Calling stop() from the signal handler before load() loaded the stream positions, causes a call to persist((). The persist call saves the uninitialized stream positions (an empty array). This issue becomes more frequent / likely the longer it takes the projection to load the stream positions. Putting a sleep() call after the acquireLock() call in the example above makes it rather easy to reproduce (we have replaced the locking mechanism with metadata locks on a 60 second timeout, hence more likely for us to run into this issue). The current workaround for us seems to be to introduce a new flag $this->streamPositionsLoaded and have an early return in persist() if positions have not been loaded yet.

    private function persist(): void
    {
        if (!$this->streamPositionsLoaded) {
            return;
        }

        $this->readModel->persist();
 // ... 

We are running this in testing at the moment to see if this actually fixes the issue. Any other input / feedback / solutions are welcome though.

prolic commented 3 years ago

Sounds like a valid solution. Can you submit a pull request please?

fritz-gerneth commented 3 years ago

Will so once I have had this in production for a while to verify.

Sent from Ninehttp://www.9folders.com/


Von: Sascha-Oliver Prolic @.***> Gesendet: Dienstag, 28. September 2021 23:39 An: prooph/pdo-event-store Cc: Fritz Gerneth; Author Betreff: Re: [prooph/pdo-event-store] Calling stop() can cause projections to loose stream positions (#233)

Sounds like a valid solution. Can you submit a pull request please?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/prooph/pdo-event-store/issues/233#issuecomment-929645471, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAJ4DCYOEESZW4PXEWUVQUDUEIY2BANCNFSM5E6AXKBQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

fritz-gerneth commented 3 years ago

In an additional thought, I think it is also possible to delete/reset/.. the projection before having acquired the lock for it. This can happen if the locking projection (1) is sleeping, status is set to RESETTING, and another instance of the projection (2) is started. Before we acquire the lock in projection2, we reset the projection and set the state to IDLE. When projection 1 wakes up again it checks for state, and does continues to run without knowing it had been resetted.

Projection 1          CLI            Projection 2
  RUNNING
     |
   (sleep)           RESET
     |                                Start Projection
     |                                   |
     |                                Check remote status
     |                                   |
     |                                 Reset, set remote status to IDLE
     |                                   |
     |                              Try to acquire lock (and fail)
     |                                   |
    (wakeup)
     |
 Check remote status (IDLE)
     | 
  Continue to run with previous 
   stream positions

The problematic section is this block here, where we do writing operations before we acquired a lock. I suggest to remove this block without replacement, and move the same block in the inner loop at the top of the loop (so resets/deletes after projection start happen before we start applying events).