Adding Instrumentation - Githubissues

Hey,

I'm interested in adding Instrumentation (primarily metrics for now) to track some lock stats. The metrics I'm trying to add at the moment are:

Total number of locks attempted to be acquired.
Current number of held (active) locks.
Total number of lock acquisition failures/timeouts.
How long it takes to acquire locks.
How long locks are held for.

Doing so would require adding System.Diagnostics.DiagnosticSource as a package dependency (which I don't see as being an issue, but let me know otherwise).

I had a crack at adding the instrumentation deep down in DistributedLock.Core as a static class:

Then calling it from DistributedLockHelpers in the TryAcquireAsync method:

However, there's some issues with this, such as metrics being skipped if a lock is manually created outside of a provider. Another thing I couldn't quite figure out would be how to cleanly determine when a lock is disposed at this point, as the CancellationToken on the lock handle isn't a good way to determine when a lock has been disposed.

A nice thing to have would be developer configurable Instrument names, instead of them being hard-coded. MassTransit achieves this doing the following:

An extension method to enable instrumentation.
This extension method allows the developer to pass in a configurator to change the default Instrument names as defined in this class.
The actual tracking of metrics then happens in this static class.

If attempting to do something similar in this library I feel this would need to be exposed in each of the implementation specific packages at the lock constructor level, similar to the connection builders.

Have you got any ideas of how (or if I should) to go about adding this functionality. Also happy to write up docs of how to collect the metrics via OpenTelemetry after I and/or anyone else implements this functionality.

Some relevant docs on Instrumentation: https://docs.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation

Cheers.

@joshdixon thanks for filing. This seems like a cool idea and something I'd be supportive of adding if we can find a good design. Therefore, I think it makes sense to start by fleshing out the design side. If you're excited about contributing it that would be great!

Some general questions/thoughts I have to kick off the discussion:

Some questions on the proposed metrics
1. For "Total number of lock acquisition failures/timeouts.", should this include TryAcquire operations? What if they have a timeout of 0? What about canceled acquires?
2. For "How long it takes to acquire locks.", is this only for successful acquires?
3. For "How long locks are held for.", what exactly does this measure? For example does it include the time required to release the lock?
If we add this, we should add metrics for all of the synchronization primitives (locks, RW locks, and semaphores). Mostly it is the same measures, but for RW locks we probably want to track the different lock levels separately as well as the upgrade operation.
Do we want names to be sharded by implementation or not? On one hand, if you have an application that uses FileDistributedLock for machine-level locking and SqlDistributedLock for global locking, it would be frustrating to have these outputting to the same metric. On the other hand, if you were to use only SqlDistributedLock and then swap it out for RedisDistributedLock maybe you'd want the same metric names so you could monitor the change. I'm leaning towards implementation-specific.
You mentioned allowing developers to override the names. I'm not terribly excited about this for a few reasons, but I'd be interested to understand the use-case more and whether this sort of thing is "standard" among similar libraries:
1. Seems like this would either require "global" configuration which the library doesn't have today or would be prone to inconsistency if people create locks/providers in different ways across the app
2. This would make it more difficult/impossible to add metrics around global state. For example I could see it being useful to have metrics around connection usage for RDBMS connection multiplexing, which would be global (perhaps that's already instrumented in the RDMBS drivers themselves, though?).
3. Probably we'd need some kind of caching layer to avoid recreating the same counters/histograms over and over again as new locks/providers are instantiated.
Should instrumentation be opt-in, opt-out or always-on? If it is configurable, how do users configure it? Lock/provider constructor options, presumably?
I think the implementation needs to be built into the lock handles themselves. E.g. we could capture a timestamp on acquisition and then log the held duration when the handle is disposed. Potentially we could leverage a wrapper or base class pattern where we layer on a common handle that adds this functionality.

As we hash some of these design questions out, we can also try to build towards a concrete API proposal specifying all the new surface area to be added.

Also happy to write up docs of how to collect the metrics via OpenTelemetry after I and/or anyone else implements this functionality.

This would be great!

@madelson, awesome to have you on board!

I tried to get it working with minimal changes prior to making the issue but it doesn't seem like that'll be feasible whilst keeping the code clean and achieving all my goals. So if we're gonna make more significant changes, then yeah working out the best design is a good path to go down.

Regarding the proposed metrics, you make some good points:
- For total number of failures, we could either separate this into two separate metrics for TryAcquire vs Acquire, or we could use a tag on any recorded metric to be able to distinguish the two. I'm leaning towards a separate metric as a failure for Acquire would be more alarming than a failure for TryAcquire.
- For the duration to acquire a lock, I would include failed attempts in here too as I feel you would use the metric as a general overview to determine latency to the lock provider. Again would be good to include whether it failed to acquire in a tag.
- For measuring how long the locks are held for, I would imagine this would be how long the developer's code is holding the lock, plus the length of time it takes to release the lock (as I imagine nothing else could acquire this lock until that process has completed). Therefore, its measuring how long another potential process would have to wait to be able to acquire the same lock. Which would be good to have observability over as a developer so they can try to minimise it.
Yeah would be good to keep it consistent across the board. I don't have much experience using the other kind of primitives though.
I feel like a tag would work best to track what implementation is being used. There's a finite number of possibilities so there's no issue there, and I feel like having instruments per implementation would lead to less maintainable code and make it harder to swap providers as you'd need to edit any dashboards in your observability stack to point to the new names.
Customisable names were just on the wishlist. Happy to drop this requirement to reduce the scope of these changes. I've not seen any other library but MassTransit implement this. The ASP.NET instrumentation doesn't allow you to change the instrument names as far as I'm aware.
It can't be always-on. You don't want to be forcing a slight performance hit on users who won't be using it. Perhaps some things could be always-on such as the timestamp recording, as other areas could make use of this in future. However the actual updating of instruments should either be opt-in, or opt-out. Most of the time its opt-in. I'm not sure where this would be configured though, since as you said there's currently no global state. If you wanted to opt-in, it'd be annoying to have to configure it every time you create a lock. But I guess you should be using a shared provider if you want to use the same settings across all your locks.
A shared base class or wrapper would definitely make sense for this. Could leverage that the handles implement IDisposable and hook into that for some of the metrics. Would just need to make sure its consistently recorded at the same point in time across all implementations, be it before the lock is released, or after.

I mentioned tags a lot, to summarise I think it'd be good to have the following tags:

lock_name: lock name
implementation_name: the name of the implementation used to attempt to acquire the lock
cancelled: whether the acquire process was called
failed: whether the acquire process failed
timeout_duration: the timeout duration specified for the acquire process

I don't have much experience using the other kind of primitives though.

Don't worry there is not a lot of difference; if we can solve for locks we can easily extend to all the primitive types.

The ASP.NET instrumentation doesn't allow you to change the instrument names as far as I'm aware.

Feels like a good enough reason not to do it here either.

I'm not sure where this would be configured though, since as you said there's currently no global state.

I can see at least 3 options:

Add this in a consistent way to the Options builder for each implementation/provider (we can use an internal interface to force consistency)
Use an AppContext switch (global state). I wouldn't go with this unless there is a lot of prior art to suggest it is an idiom.
Make this a fully-external wrapping layer, for example:
```
var provider = new InstrumentedDistributedSynchronizationProvider(new SqlDistributedSynchronizationProvider(...));
```
The latter would be easiest to implement; we wouldn't have to make changes to the individual implementations at all. OTOH, this forces you to work solely through the interfaces rather than coding against specific implementations which is an issue if we add implementation-specific methods in the future. This would also make a bit more difficult to easily turn instrumentation on/off via a configuration setting. Thoughts? Are there other examples we can take from popular libraries?

Could leverage that the handles implement IDisposable and hook into that for some of the metrics.

Yes. You might have seen that most of the implementations follow an "inner handle" pattern where the public handle wraps another handle object. The ones that don't could easily be changed to follow this pattern. Then, the instrumentation handle could just be another (optional) internal wrapping layer so you have PublicHandle(InstrumentationHandle(InnerHandle)).

I mentioned tags a lot

Agreed. I hadn't appreciated how tags make for a really nice solution to this problem. Some thoughts on the ones you've laid out:

I'd suggest Name instead of lock_name (using PascalCase since that's what the .NET examples do) so that we can share this tag with rw lock and semaphore
Instead of ImplementationName maybe Type and it could store something like AzureBlobLeaseDistributedLock?
failed feels kind of unclear; what if we instead had Result and it could be one of Success, Timeout, Canceled, or Faulted (hit an exception)? Should we differentiate between TryAcquire timing out and TryAcquireAsync timing out either within this tag or with another tag? So long as you can distinguish timeouts from other exit conditions I'm not sure it matters.
timeout_duration: Are you imagining a numeric value here? If so, I feel like we should have units (e.g. timeoutMillis). A concern I have is that in some usage patterns this could have a huge range of values (e.g. someone is computing the timeout dynamically). Would that make the data difficult to use? I still don't have a great sense of how these tags affect aggregation.
Semaphores should additionally have a MaxCount tag
RW locks should additionally have a Level tag that can be one of Read, UpgradeableRead, Upgrade, or Write.

madelson / DistributedLock

Adding Instrumentation #137