microsoft / garnet

Garnet is a remote cache-store from Microsoft Research that offers strong performance (throughput and latency), scalability, storage, recovery, cluster sharding, key migration, and replication features. Garnet can work with existing Redis clients.
https://microsoft.github.io/garnet/
MIT License
10.44k stars 530 forks source link

Integrate OpenTelemetry to enhance observability #379

Open springmin opened 6 months ago

springmin commented 6 months ago

Feature request type

enhancement

Is your feature request related to a problem? Please describe

integrate OpenTelemetry or other trance component to enhance observability。

Describe the solution you'd like

integrate OpenTelemetry or other trance component to enhance observability。

Describe alternatives you've considered

No response

Additional context

No response

PaulusParssinen commented 6 months ago

I've been thinking about this for a while (and implementing the server using Generic Host in general).

By using the new System.Diagnostics.Metrics APIs (Counter<T>, UpDownCounter<T>, Histogram<T> etc. ..) would allow us to delete HdrHistogram fork altogether from the repo and simplify our metrics/monitoring logic a lot. The places where we pull the metrics inside the server (LATENCY etc?) we should be able to use simple MeterListener to aggregate them). I'm not sure if Histogram<T> covers everything that HdrHistogram offers and someone who knows statistics better than me should weigh in on that (https://github.com/dotnet/runtime/issues/63650 seems to be useful addition).

We should follow the existing OTel semantic conventions for Redis but of course modify/extend it if needed. Also configurations knob whether to record db.statement or not, as I believe there is currently no redaction capabilities in Garnet for potentially sensitive parts of the command.

Measuring anything should become a lot easier, but we should follow how ASP.NET & System.Net guards their metrics logic to keep the overhead next to none IF the metrics are disabled (Instrument.Enabled). Should the metrics codepaths should be entirely eliminated when Garnet is built with <MetricsSupport>false</MetricsSupport> given that would also affect the behavior of the RESP commands that rely on it, if they are implemented using MeterListener?

We should also measure if any extra contention (IIRC currently Garnet's counters are done by atomic Interlocked.Increments) for normal use-cases. We should keep eye on how big difference there is in RPS with, for example, dotnet-counters hooked.

Some existing tracking issues for S.D.Metrics performance concerns https://github.com/dotnet/aspnetcore/issues/50412, https://github.com/dotnet/runtime/issues/71563 to keep eye on too.

(I've been thinking about this for quite bit because I'm planning to write Crank infrastructure & scenarios to bench Garnet, someday.)

badrishc commented 6 months ago

It would be a great idea to support OpenTelemetry on the server side, and contributions in this space are highly encouraged! Not only is this generally useful, but when used with .NET Aspire it will make the metrics automatically available. Doing this seems to basically require some glue code to feed data from our current Metrics API endpoint to OpenTelemetry:

https://github.com/microsoft/garnet/blob/8856dc3990fb0863141cb902bbf64c13202d5f85/libs/host/GarnetServer.cs#L55

The potential migration to System.Diagnostics.Metrics is orthogonal though. It needs to be considered with care:

We should also measure if any extra contention (IIRC currently Garnet's counters are done by atomic Interlocked.Increments)

HdrHistogram is very efficient, and was better than the Metrics API last we checked. To avoid contention, we keep per-session metrics so that they can be updated without requiring atomics such as Interlocked.Increment. In the critical path, it is just a simple add on the session: https://github.com/microsoft/garnet/blob/8856dc3990fb0863141cb902bbf64c13202d5f85/metrics/HdrHistogram/LongHistogram.cs#L153-L154

Then we aggregate periodically (or on-demand) across all active and inactive sessions to perform the metrics computation. HdrHistogram is really efficient on the record metrics path and does the expensive percentile computations off the critical path.

Meir017 commented 3 months ago

the project is being compiled to both net6 & net8, the new metrics API's exist starting with net8

badrishc commented 3 months ago

We are removing support for .NET 6 here: https://github.com/microsoft/garnet/pull/580

Hope to see a contribution in this space!

rektide commented 2 months ago

.net6 support now removed, in #582