This is a benign issue with client side retry as workaround.
When a client (belonging to a JS enabled account) connects to the meta leader of a 3 member cluster.
and immediately creates a stream, the followign race condition
between jsStreamCreateRequest and processStreamAssignment is possible.
Directly connecting to the meta data leader increases the odds of the race but is not necessary in a cluster of >=3.
Each follower learns about the account via the route code.
Because this happens through the route, to avoid a deadlock, the accounts are set up as expired and incomplete.
The deadlock avoidance is true for at least the nats-resolver.
This situation is the mostlikely one, but anyting that will cause the following will do.
Because of this account marked as expired, the stream create message (received by all members of the cluster)
invokes jsStreamCreateRequest and triggers the actual account setup of the account (updateAccountClaimsWithRefresh is being invoked).
Due to re-locking of the account in this function (as well as its callers) the replicated stream assignment is processed concurrently.
Sometimes, processStreamAssignment then finds a partially set up account where JetStream is not enabled and returns an error.
Below traces for one particular account show:
goroutine 13532 register the account
goroutine 13446 in jsStreamCreateRequest starting updating the account in updateAccountClaimsWithRefresh
goroutine 13594 in processStreamAssignment finding an incomplete account and generating an error in lookupStream. (incomplete as claimJWT is empty)
goroutine 13446 in jsStreamCreateRequest finishing updating the account in updateAccountClaimsWithRefresh
In my local unit test runs this happened rarely. (2 in 100)
The test will fail as the stream can not be created, due to jetstream not being enabled in the account.
I altered the test to retry immediately, and then it always passes.
Hence this is not a severe issue, but one to fix some time.
I imagine a specific workaround could be, if not meta leader, avoid getRequestInfo and the resulting account lookup inside jsStreamCreateRequest.
Probably has to be adapted to everywhere where this pattern applies.
Furthermore I noticed that in a few locations the error codes returned are incorrect.
Usage of NewJSNotEnabledForAccountError() where NewJSNotEnabledError() would be the correct choice.
(see code below)
This is a benign issue with client side retry as workaround.
When a client (belonging to a JS enabled account) connects to the meta leader of a 3 member cluster. and immediately creates a stream, the followign race condition between
jsStreamCreateRequest
andprocessStreamAssignment
is possible.Directly connecting to the meta data leader increases the odds of the race but is not necessary in a cluster of >=3.
Each follower learns about the account via the route code. Because this happens through the route, to avoid a deadlock, the accounts are set up as expired and incomplete. The deadlock avoidance is true for at least the nats-resolver. This situation is the mostlikely one, but anyting that will cause the following will do.
Because of this account marked as expired, the stream create message (received by all members of the cluster) invokes
jsStreamCreateRequest
and triggers the actual account setup of the account (updateAccountClaimsWithRefresh
is being invoked). Due to re-locking of the account in this function (as well as its callers) the replicated stream assignment is processed concurrently. Sometimes,processStreamAssignment
then finds a partially set up account where JetStream is not enabled and returns an error.Below traces for one particular account show:
jsStreamCreateRequest
starting updating the account inupdateAccountClaimsWithRefresh
processStreamAssignment
finding an incomplete account and generating an error inlookupStream
. (incomplete as claimJWT is empty)jsStreamCreateRequest
finishing updating the account inupdateAccountClaimsWithRefresh
In my local unit test runs this happened rarely. (2 in 100) The test will fail as the stream can not be created, due to jetstream not being enabled in the account. I altered the test to retry immediately, and then it always passes.
Hence this is not a severe issue, but one to fix some time. I imagine a specific workaround could be, if not meta leader, avoid
getRequestInfo
and the resulting account lookup insidejsStreamCreateRequest
. Probably has to be adapted to everywhere where this pattern applies.Furthermore I noticed that in a few locations the error codes returned are incorrect. Usage of
NewJSNotEnabledForAccountError()
whereNewJSNotEnabledError()
would be the correct choice. (see code below)traces from unit test where this was discovered:
debugging code and location of wrong error.