Support for Reliable Collection metrics

microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.

https://docs.microsoft.com/en-us/azure/service-fabric/

MIT License

3.03k stars 401 forks source link

Support for Reliable Collection metrics #683

Open MedAnd opened 6 years ago

MedAnd commented 6 years ago

For capacity planning, operational, performance & scalability reasons this feature would be greatly beneficial, please add support for Reliable Collection metrics.

For example operational teams would greatly benefit by knowing:

Count of keys in a Reliable Collection (per partition / total)
Memory and disk usage (per partition / total)

This feature will support among other use cases:

Operational dashboard, monitoring & altering
Custom scalability based on these metrics
Capacity planning

etc

Adding support via Service Fabric REST APIs will mean power-shell, Service Fabric Mesh and Service Fabric explorer would also be able to take advantage of this functionality.

Prototype reference work from @vturecek : metric-reliable-collections from 2016 ☺️

preethasubbarayalu commented 6 years ago

Hi MeAnd, These numbers should be available in our next release through perf counters. Expected ETA - next couple of weeks.

Thanks Preetha

MedAnd commented 6 years ago

Hi @preethasubbarayalu, great news and really appreciate the teams turnaround! Have a few questions, should I ask here or via email etc? Thx.

preethasubbarayalu commented 6 years ago

Hi @MedAnd, Either is fine. If it requires investigation and drilling into our traces - filing a support incident will be faster.

Thanks Preetha

MedAnd commented 5 years ago

Hi @preethasubbarayalu, just checking the 6.4 release notes and cannot see mention of above. Could you please confirm this has been implemented and documented? Thx.

preethasubbarayalu commented 5 years ago

Below link includes the list of counters added in 6.4

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-diagnostics#tstore-performance-counters

MedAnd commented 5 years ago

@preethasubbarayalu @masnider - when scaling a cluster down (for example from 6 nodes to 5) and if the replicate set is kept at 5 for stateful services instead of also reduced down to 3, seems to result in a drastic correlation increase to the volume of metrics emitted. Not sure if this is a bug?

Support for node level rolled-up (aggregated) metrics

JohnNilsson commented 4 years ago

Are these metrics still available when running .net core service hosts?