microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

Could not rely on StateManagerChanged notification to build in memory projection of Reliable Collection #669

Open pmgfrederico opened 5 years ago

pmgfrederico commented 5 years ago

Dear sirs,

Is the Reliable Service notifications documentation up to date with regards to the features' expected behavior?

The use case I am after is about having a simple in memory repository that is projected from an existing reliable collection. I was expecting to upgrade the application and that with nodes coming down and up again Rebuild notifications would allow me to hydrate the repository seamlessly. The thing is that on the Local Cluster debugging with AutoUpgrade I more or less get the expected behavior but not on a remote production cluster (only being signaled the NotifyStateManagerChangedAction.Rebuild but never the NotifyStateManagerChangedAction.Add where all bindings take place).

Other findings are, Rebuild notifications don't ever have any ReliableStates sent with their NotifyStateManagerChangedEventArgs. Even when the RebuildNotificationAsyncCallback is called upon will the ReliableStates ever be shared via the notification's arguments. In fact it seems to trigger the NotifyStateManagerChangedAction.Add action and then run through the DictionaryChanged notifications as if it was the first seed of the data. This is very hard to follow, when the API suggests a behavior that doesn't come true.

My question is, can I rely on Notifications for the aforementioned use case? Any ideas why the StateManager isn't behaving as expected?

Any help will be most appreciated.

Regards, PF

Microsoft.ServiceFabric.Services.* 3.2.162 Fabric Runtime 6.2.283.9494

masnider commented 5 years ago

You can rely and what you have described is the expected behavior. There's a basic sample here but I don't know that it is complete for the scenarios you're trying to cover. Can you share some snippets of your code showing how you're setting up the notifications and then where the inforamtion you're getting back isn't correct?

pmgfrederico commented 5 years ago

Hi.

Thanks for taking the time to investigate.

I've created a gist with the implementation I'm using and it feels pretty much the same as the one you have shared on the aforementioned sample.

Just like in the sample I am subscribing the StateManagerChanged event on the StatefulService ctor and in development, as I previously mentioned, I kinda get the expected behavior. The one scenario that puzzles me is that when deploying to production (where the ReliabeDictionary I'm sourcing this projection from already exists and can't be wiped out) the StateManager won't run through the proper Dictionary Notifications after Rebuild - which I expect to happen when replicas are upgraded and need to reinstante the service's state.

(The service is running on a single partition deployment with 3 instances.)

Can I share any other info that might help address the problem at hand?

Could there be a thread I could pursue like intercepting/logging life-cycle events either from Service or ReliableState that are pre-requisites for the Notifications to be properly triggered?

Regards, PF

zuhairp commented 5 years ago

You'll need to handle state manager rebuild as well, at which point you can add rebuild handlers for existing dictionaries. The documentation only briefly mentions it, and it's missing from the sample as well, so we'll need to address that documentation gap.

Anyways, you'll add this to your StateManager_StateManagerChanged method:

case NotifyStateManagerChangedAction.Rebuild:
    {
        var operation = e as NotifyStateManagerRebuildEventArgs;
        var reliableStatesEnumerator = operation.ReliableStates.GetAsyncEnumerator();
        while (await reliableStatesEnumerator.MoveNextAsync(CancellationToken.None))
        {
            IReliableState reliableState = reliableStatesEnumerator.Current;
            // TODO: Add dictionary rebuild handler to reliableState as necessary
        }

        break;
    }

Please let us know if this helps. Thanks!

pmgfrederico commented 5 years ago

Hi @zuhairp

Thanks for diving in.

I had similar code to the one you are suggesting (sadly the gist no longer represents that). That is the case I have described in the first comment, the ReliableStates won't ever return any state. Also await(ing) on a void EventHandler...!?

Am I misreading something here?

zuhairp commented 5 years ago

Depending on the state of the system, you may get some or none of the ReliableStates in the rebuild notification, and the rest as individual SingleEntityChanged events. Just because it comes as an Add event does not mean that it is a newly created ReliableState, all the data you added to the state should still be there. Is that what you are seeing?

pmgfrederico commented 5 years ago

Sadly that's not what I am seeing.

When I upgrade the application on the Cluster the in-memory repository never gets re-hydrated despite the ReliableDictionary having its state consistent. (I got this working by removing the app on a Development environment but as you can imagine I can't rely on that on a Production environment so I am also inclined to some state peculiarity different between environments but I have no idea if I can somehow audit the system to try drawing any conclusions).

With regards to your previous statement I tried to locally trace StateManagerChanged events by bringing nodes up and down on the 5 node cluster and what I have witnessed is that even when a Primary Replica sees its StateManager Rebuild event triggered (e.g. restarting a node to force role change or via powershell role change command) I can't enumerate across any of the ReliableStates that I am expecting to find.

Basically the solution works when I see a Rebuild followed by an Add action, but it doesn't when I see only a Rebuild action and nothing else (thus no Dictionary notifications are subscribed). I am tracing this behavior using application insights on a time span neighborhood of deploying an application upgrade on a remote cluster.

(Almost giving up on this approach towards a persisted secondary index like collection, but I'll honor your suggestion with a final upgrade to validate if I'm missing any scenario when Debugging)

With regards to my async void comment (I understand that event handlers are basically the only scenario where that is tolerated, nevertheless if gives me the heebie-jeebies) is there any downside/unexpected behavior (besides the obvious) of blocking over the enumerator?

zuhairp commented 5 years ago

Sorry for the delay in my response, I have been away. I would like to get to the bottom of this. Would you be able to open a support ticket through Azure portal for your production cluster? That way I can take a look through our SF traces to see what can be going on.

juho-hanhimaki commented 5 years ago

Would be great to get complete sample of using notifications (including rebuild and all possible other events that need to be handled). Documentation and samples are not great.