Closed marcospereira closed 4 years ago
I think the problem is CoordinatedShutdown
closes the ProjectionWorker
after the thread pool for Slick tasks has been closed. Shutting down the ProjectionWorker
too late allows the queryByTag
to poll the DB (even if it was closed already).
What I'm seeing is that the queryByTag
stream in the ProjectionWorker
doesn't seem to be killed until the [cluster-exiting]
phase where cluster singletons are killed. I think the ProjectionWorker
simply gets a Stopped$
message and that's not handled in a nicely-enough way (?).
~The quick solution is to register a CoordinatedShutdown
task to kill the stream early so that when the cluster-exiting
phase arrives all the Slick backlog has completed. But I'm not 100% sure that's correct either, I suspect this issue is not only affecting shutdown but also any attempt to stop the ProjectionWorker
during normal use, in which case killing the stream early is not enough.~
EDIT: I need to investigate a bit more and read the logs and the stack trace again. I think I got this backwards.
Hum, so cluster-exiting
phase happens before actor-system-terminate
which is where the dispatchers are terminated as well (unless we are talking about a custom internal dispatcher and not one configured/provided by Akka): https://doc.akka.io/docs/akka/snapshot/coordinated-shutdown.html
I wonder if for tests we are misconfiguring terminate-actor-system
and run-by-actor-system-terminate
. 🤔
unless we are talking about a custom internal dispatcher and not one configured/provided by Akka
Yes, we are:
And:
Where do we shutdown Slick? Is it still depends on ApplicationLifecycle
? If so, in which phase of CS is ApplicationLifecycle.stop
running? Stopping my (out of curiosity) investigation for now.
Where do we shutdown Slick? Is it still depends on
ApplicationLifecycle
? If so, in which phase of CS isApplicationLifecycle.stop
running? Stopping my (out of curiosity) investigation for now.
ApplicationLifecycle
stop hooks are invokedphase-service-stop
I'm not sure the question 1.
("Where do we shutdown Slick") is the right one though. The offending bit is a queryByTag
implemented by akka-persistence-jdbc
(aka APJDBC) and I must double check if APJDBC uses the DatabasedDef
provided by Lagom's SlickDbProvider
.
I just realised there are two Database's in play (no pun intended) on a ReadSideActor
and it could be only one of them causing issues:
queryByTag
OffsetStore
and user tables).I think there are some issues in Lagom:
ApplicationLifecycle
stop hooks (which run on the service-stop
phase of CoordinatedShutdown
) is wrong. Being part of the infrastructure, anything related to accessing the database should be closed later.DatabaseDef
and, instead, use slick.db
settings to create one, then the instance is not closed.I think both are legit issues.
I think both are legit issues.
After fixing those, I think we should also add a shutdown task for every projection worker to stop during the service-stop
phase of CoordinatedShutdown
. The reasoning is that a projection worker is producing traffic across de service (by means of polling the DB).
So putting it all together:
service-stop
finishes all ongoing work, including the projection streams (or even the worker itself)cluster-stop
shuts down anything related to the cluster (including ShardRegion
, sharded entities, etc..)Oh, great. So after some cleanup for slick shutdown now the error is:
2019-11-06 14:50:31,084 DEBUG akka.actor.CoordinatedShutdown - Performing phase [before-cluster-shutdown] with [0] tasks
2019-11-06 14:50:31,085 DEBUG akka.actor.CoordinatedShutdown - Performing phase [cluster-sharding-shutdown-region] with [2] tasks.
2019-11-06 14:50:31,085 DEBUG akka.actor.CoordinatedShutdown - Performing task [region-shutdown] in CoordinatedShutdown phase [cluster-sharding-shutdown-region]
2019-11-06 14:50:31,107 DEBUG akka.cluster.sharding.ShardRegion - ShoppingCartReportProcessor: Starting graceful shutdown of region and all its shards
2019-11-06 14:50:31,108 WARN akka.stream.scaladsl.RestartWithBackoffSource - Restarting graph due to failure. stack_trace:
java.util.concurrent.CompletionException: java.lang.IllegalStateException: EntityManagerFactory is closed
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
java.util.concurrent.CompletionException: java.lang.IllegalStateException: EntityManagerFactory is closed
🙄🤦🏼♂️
For example:
https://travis-ci.com/lagom/lagom-samples/jobs/251610682#L1512-L1586
This happens for the test using
TestServer
withdefaultSetup().withJdbc()
: