jstedfast / MailKit

A cross-platform .NET library for IMAP, POP3, and SMTP.
http://www.mimekit.net
MIT License
6.14k stars 815 forks source link

Add Activity/OpenTelemetry support #1499

Closed pengweiqhca closed 3 months ago

pengweiqhca commented 1 year ago

Is it possible to wrap the calling server around the Activity? So that, I can get elapsed time per calling.

jstedfast commented 1 year ago

You could subclass any class in MailKit to add whatever functionality you want.

I don't want to take a dependency on OpenTelemetry because most of my users won't want it.

EugeneKrapivin commented 8 months ago

@jstedfast it doesn't require any external dependency. In-order to implement tracing and metrics in dotnet you use constructs from dotnet, namely System.Diagnostics.Metrics and System.Diagnostics.ActivitySource.

jstedfast commented 8 months ago

@EugeneKrapivin right, but I'm talking about not wanting to add the OpenTelemetry NuGet dependency.

Maybe I'm just misunderstanding the feature request?

EugeneKrapivin commented 8 months ago

@jstedfast To get the library instrumented (tracing, metrics, logs) you don't have to use any external dependencies as the maintainer of this library. The users of your library, if they chose to export those traces, logs and metrics will use OpenTelemetry libraries.

I'm working on a small POC, and actually using MailKit (that's how I wound up here), so I have a handy example:

In a class called SmtpSenderService I have:

public SmtpSenderService(ILogger<SmtpSenderService> logger, IMeterFactory meterFactory)
{
    _logger = logger;

    _meterFactory = meterFactory;
    _meter = _meterFactory.Create("SMTPSender"); // remember this part

    _createClientHistogram = _meter.CreateHistogram<double>("smtp_create_client", "ms");
    _connectClientHistogram = _meter.CreateHistogram<double>("smtp_connect_client", "ms");
    _sendMessageHistogram = _meter.CreateHistogram<double>("smtp_send", "ms");
    _exceptionsCounter = _meter.CreateCounter<long>("smtp_exception", "exception");
}
private static ActivitySource source = new("SmptSenderService", "1.0.0"); // remember this part

ILogger, IMeterFactory, Meter and ActivitySource are part of

using System.Diagnostics;
using System.Diagnostics.Metrics;
using Microsoft.Extensions.Logging;

namespaces, which require no external dependencies. In some SendMessage method I have the following code:

if (_client is null)
{                
    var startCreate = Stopwatch.GetTimestamp();

    _client ??= new SmtpClient();

    var startConnect = Stopwatch.GetTimestamp();

    using (var _ = source.StartActivity("Smtp.Connect", ActivityKind.Client))
        await _client.ConnectAsync(config.ServerUrl, int.Parse(config.ServerPort), false, cts.Token);

    using (var _ = source.StartActivity("Smtp.Authenticate", ActivityKind.Client))
        await _client.AuthenticateAsync(config.Username, config.Password, cts.Token);

    var elapsedConnect = Stopwatch.GetElapsedTime(startConnect);
    var elapsedCreate = Stopwatch.GetElapsedTime(startCreate);

    _createClientHistogram.Record(elapsedCreate.TotalMilliseconds);
    _connectClientHistogram.Record(elapsedConnect.TotalMilliseconds, new KeyValuePair<string, object?>("flow", "create"));
}

and

using var sendActivity = source.StartActivity("Smtp.Send", ActivityKind.Client);
try
{
    var r = await _client.SendAsync(message, cts.Token);
    var total = Stopwatch.GetElapsedTime(start);

    sendActivity?.AddTag("status", "success");
    sendActivity?.AddEvent(new ActivityEvent("email sent"));
    _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "success"));
}
catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.RecipientNotAccepted)
{
    _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
    var total = Stopwatch.GetElapsedTime(start);

    _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
    _exceptionsCounter.Add(1,
            new KeyValuePair<string, object?>("exception", ex.GetType().Name),
            new KeyValuePair<string, object?>("flow", "send"));

    sendActivity?.AddTag("status", "failed");
    sendActivity?.AddTag("exception", ex.GetType().Name);
    sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

    return new(false, $"failed with {ex.ErrorCode}");
}
// the catch clauses continue, just to make sure I catch all the exceptions possible and log them properly

As I've said before, none of this code requires any external dependencies. Having this code in MailKit could really help with observability and monitoring for mailkit users.

In a different place in my service (the host part) I have my dependencies on OpenTelemetry:

builder.Services
    .AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing.AddSource("SmptSenderService");
    })
    .WithMetrics(metrics =>
    {
        metrics
            .AddMeter("SMTPSender");
    });

these lines make sure that when traces and metrics are exported, my metrics and traces are also exported. Once exported I have my metrics: image

and some fancy traces: image

and because logs are also enriched when exported: image

Having MailKit instrumentation out of the box would be really helpful in reducing boilerplate code for monitoring the system with traces and quality metrics.

jstedfast commented 8 months ago

Okay, how do you propose adding this to MailKit?

EugeneKrapivin commented 8 months ago

Well I'd suggest:

This is probably a huge ask, and I think it should be opened for discussion to see if it's worth the investment (aka will users of the library use it). For the purposes of my POC it was worth it because it gave me the data I needed. However, it wasn't a clean insert :) . There was a lot of copy-paste engineering on my part to cover all exceptions and operations (Create, Connect, NoOp, Send, Disconnect).

As a trend the dotnet community pushes toward adding instrumentation to libraries (especially IO bound libraries) so that distributed tracing (which is priceless in debugging microservices plane) could be done through OpenTelemetry which is the defacto standard for observability in virtually all languages.

just as an example, my PoC uses: ASP.net minimal APIs, KafkaFlow, Orleans, Azure Tables (for orleans storage) and MailKit. Except MailKit all the others have OOTB Tracing instrumentation, passing the tracing context between ASP.net, kafkaflow, orleans and AzureTables SDK was effortless. All I had to do is add their names to the OpenTelemetry configuration. After wrapping some MailKit calls in activities:

image

which is accompanied by a correlated log:

{
    "body": "failed sending an email with error SenderNotAccepted",
    "traceid": "b62dd4170b6c0b0da30a6dce0c0fcdf5",
    "spanid": "b37a86f71fc2c09f",
    "severity": "Error",
    "flags": 1,
    "attributes": {
        "ParentId": "e5e6b9e3c5dce273",
        "SpanId": "b37a86f71fc2c09f",
        "TraceId": "b62dd4170b6c0b0da30a6dce0c0fcdf5",
        "errorCode": "SenderNotAccepted",
        "exception.message": "Invalid sender user681@mailhog.local",
        "exception.stacktrace": "MailKit.Net.Smtp.SmtpCommandException: Invalid sender user681@mailhog.local\r\n   at MailKit.Net.Smtp.SmtpClient.SendAsync(FormatOptions options, MimeMessage message, MailboxAddress sender, IList`1 recipients, CancellationToken cancellationToken, ITransferProgress progress)\r\n   at MailKit.Net.Smtp.SmtpClient.SendAsync(FormatOptions options, MimeMessage message, MailboxAddress sender, IList`1 recipients, CancellationToken cancellationToken, ITransferProgress progress)\r\n   at Processor.Grains.SmtpSenderService.SendMessage(SmtpConfig config, EmailRequest emailRequest) in C:\\Users\\nocgod\\source\\repos\\EmailServicePoc\\Processor\\Processor.Grains\\SmtpSenderService.cs:line 138",
        "exception.type": "SmtpCommandException",
        "{OriginalFormat}": "failed sending an email with error {errorCode}"
    },
    "resources": {
        "service.instance.id": "243d4bd2-79e4-4ed1-a308-4b8d37bc7665",
        "service.name": "processor.host",
        "telemetry.sdk.language": "dotnet",
        "telemetry.sdk.name": "opentelemetry",
        "telemetry.sdk.version": "1.7.0"
    },
    "instrumentation_scope": {
        "name": "Processor.Grains.SmtpSenderService"
    }
}
jstedfast commented 8 months ago

Well I'd suggest:

  • Metrics - wrapping IO operations in timers (histograms of TimeSpan in my code examples)
    • adding exception/error counters with appropriate tagging

Right, but how should MailKit provide this info to the app?

  • Tracing - wrapping IO operations with Activities and Activity Events.

Again, how should MailKit provide this info? via events? via some new IClientMetrics interface that can be set on any of the clients?

I'm not sure what would make the most sense.

Adding logs using standard logging APIs in dotnet, which ultimately get collected and exported (processes that should not concern you as a library maintainer)

Currently, MailKit does not depend on Microsoft.Extensions.Logging or Microsoft.Extensions.DependencyInjection, so this would be awkward.

I've been thinking about moving to DependencyInjection as an alternative to MimeKit.ParserOptions.Register() and MimeKit.Cryptography.CryptographyContext.Register() (especially with the request to make MimeKit depend on MimeKitLite rather than being more-or-less API duplicates of each other), but so far I haven't gotten around to that or even figured out exactly how I want to do that...

What do those other libraries you are using do? It sounds like they just assume you'll be logging telemetry/metrics/etc and only really leave up the configuration to the developers, but MailKit is in a bit of a different situation because it's not really designed specifically for ASP.NET apps where you can pretty much count on everyone wanting that kind of telemetry logging.

EugeneKrapivin commented 8 months ago

Right, but how should MailKit provide this info to the app?

it should not concern the library. As long as you use System.Diagnostics.Metrics.Meter you and document properly the name of the meter, all that I need as an app developer is the name of the meter. Same goes for tracing, as long as you use activities and properly document the source name, that's enough. The following is code from my PoC:

builder.Services
    .AddOpenTelemetry()
    .WithMetrics(metrics =>
    {
        metrics
            .AddRuntimeInstrumentation()
            .AddMeter("Microsoft.Orleans")
            .AddBuiltInMeters();
    })
    .WithTracing(tracing =>
    {
        if (builder.Environment.IsDevelopment())
        {
            // We want to view all traces in development
            tracing.SetSampler(new AlwaysOnSampler());
        }

        tracing
            .AddSource("Azure.*")
            .AddSource("Microsoft.Orleans.Runtime")
            .AddSource("Microsoft.Orleans.Application")
            .AddSource(KafkaFlowInstrumentation.ActivitySourceName)
            .AddAspNetCoreInstrumentation()
            .AddGrpcClientInstrumentation()
            .AddHttpClientInstrumentation(o => {
                o.FilterHttpRequestMessage = (_) => Activity.Current?.Parent?.Source?.Name != "Azure.Core.Http";
            });
    });

Notice that Microsoft Orleans is added as a string (which they provide in their documentation) while KafkaFlow opted for releasing a little nuget with a public static readonly string to avoid magic strings. This setup code IS dependent on OpenTelemetry, however it should not concern you as a library developer.

Currently, MailKit does not depend on Microsoft.Extensions.Logging or Microsoft.Extensions.DependencyInjection, so this would be awkward.

you could expose a logging adapter and a logging abstraction like KafkaFlow does.

I've been thinking about moving to DependencyInjection as an alternative to MimeKit.ParserOptions.Register() and MimeKit.Cryptography.CryptographyContext.Register() (especially with the request to make MimeKit depend on MimeKitLite rather than being more-or-less API duplicates of each other), but so far I haven't gotten around to that or even figured out exactly how I want to do that...

I guess this will require a bigger refactoring that should probably be out-of-scope for the purposes of observability instrumentation.

What do those other libraries you are using do? It sounds like they just assume you'll be logging telemetry/metrics/etc and only really leave up the configuration to the developers, but MailKit is in a bit of a different situation because it's not really designed specifically for ASP.NET apps where you can pretty much count on everyone wanting that kind of telemetry logging.

for tracing and metrics they use standard dotnet APIs:

which are not dependant on ASP.NET specifically, as you can use them in a plain worker service without dependency on the Web SDKs.

The logging is a design decision you could make:

jstedfast commented 8 months ago

@EugeneKrapivin can you help define which metrics (and other info) you want/need? I'll put this in the queue for a future v5.0 release but first I need to know what specific metrics developers care about having.

(Note: it seems like there is growing demand for a unification of MimeKit and MimeKitLite which is going to require me to work on a v5.0 sooner rather than later to split MimeKit into MimeKit.Core and MimeKit.Cryptography - this seems like the perfect opportunity to add in metrics & tracing data).

EugeneKrapivin commented 7 months ago

@jstedfast sorry for the late reply. I'd add:

I'm adding some code from my POC project in which I've added some metrics, traces and logs around some simple usage of the SMTP client. Please do not take this example as best practice =) I'd suggest taking a look at the best practices from msft documentation :)

// note that the SmtpClient is not thread safe (we shouldn't be using reentrant here)
[StatelessWorker]
public class SmtpSenderService : Grain, ISmptSenderService, IDisposable
{
    private bool _disposedValue;
    private SmtpClient _client;
    private readonly ILogger<SmtpSenderService> _logger;

    private readonly IMeterFactory _meterFactory;

    private Meter _meter;
    private Histogram<double> _createClientHistogram;
    private Histogram<double> _connectClientHistogram;
    private Histogram<double> _sendMessageHistogram;
    private Counter<long> _exceptionsCounter;

    public SmtpSenderService(ILogger<SmtpSenderService> logger, IMeterFactory meterFactory)
    {
        _logger = logger;

        _meterFactory = meterFactory;
        _meter = _meterFactory.Create("SMTPSender");

        _createClientHistogram = _meter.CreateHistogram<double>("smtp_create_client", "ms");
        _connectClientHistogram = _meter.CreateHistogram<double>("smtp_connect_client", "ms");
        _sendMessageHistogram = _meter.CreateHistogram<double>("smtp_send", "ms");
        _exceptionsCounter = _meter.CreateCounter<long>("smtp_exception", "exception");
    }
    private static ActivitySource source = new("SmptSenderService", "1.0.0");

    public async Task<SendResult> SendMessage(SmtpConfig config, EmailRequest emailRequest)
    {
        var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
        try
        {
            if (_client is null)
            {                
                var startCreate = Stopwatch.GetTimestamp();

                _client ??= new SmtpClient();

                var startConnect = Stopwatch.GetTimestamp();

                using (var _ = source.StartActivity("Smtp.Connect", ActivityKind.Client))
                    await _client.ConnectAsync(config.ServerUrl, int.Parse(config.ServerPort), false, cts.Token);

                using (var _ = source.StartActivity("Smtp.Authenticate", ActivityKind.Client))
                    await _client.AuthenticateAsync(config.Username, config.Password, cts.Token);

                var elapsedConnect = Stopwatch.GetElapsedTime(startConnect);
                var elapsedCreate = Stopwatch.GetElapsedTime(startCreate);

                _createClientHistogram.Record(elapsedCreate.TotalMilliseconds);
                _connectClientHistogram.Record(elapsedConnect.TotalMilliseconds, new KeyValuePair<string, object?>("flow", "create"));
            }

            var reconnect = false;
            try
            {
                using var _ = source.StartActivity("Smtp.NoOp", ActivityKind.Client);
                await _client.NoOpAsync();
            }
            catch (Exception ex)
            {
                reconnect = true;
            }

            if (!_client.IsConnected || reconnect)
            {
                var startConnect = Stopwatch.GetTimestamp();

                using (var _ = source.StartActivity("Smtp.Connect", ActivityKind.Client))
                    await _client.ConnectAsync(config.ServerUrl, int.Parse(config.ServerPort), false, cts.Token);

                using (var _ = source.StartActivity("Smtp.Authenticate", ActivityKind.Client))
                    await _client.AuthenticateAsync(config.Username, config.Password, cts.Token);

                var elapsedConnect = Stopwatch.GetElapsedTime(startConnect);
                _connectClientHistogram.Record(elapsedConnect.TotalMilliseconds, new KeyValuePair<string, object?>("flow", "create"));
            }

            DelayDeactivation(TimeSpan.FromSeconds(30));
        }
        catch (AuthenticationException ex)
        {
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "connect"));

            _logger.LogError(ex, "failed authenticating with smtp server");
        }
        catch (Exception ex) when (ex is OperationCanceledException || ex is TimeoutException)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to a timeout", emailRequest.MessageId);
            _exceptionsCounter.Add(1,
                    new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                    new KeyValuePair<string, object?>("flow", "connect"));
            return new(false, "timeout");
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to an exception", emailRequest.MessageId);
            _exceptionsCounter.Add(1,
                    new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                    new KeyValuePair<string, object?>("flow", "connect"));
            return new(false, "exception");
        }

        var message = CreateMimeMessage(config, emailRequest);

        var start = Stopwatch.GetTimestamp();
        using var sendActivity = source.StartActivity("Smtp.Send", ActivityKind.Client);
        try
        {
            var r = await _client.SendAsync(message, cts.Token);

            var total = Stopwatch.GetElapsedTime(start);

            sendActivity?.AddTag("status", "success");
            sendActivity?.AddEvent(new ActivityEvent("email sent"));
            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "success"));
        }
        catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.RecipientNotAccepted)
        {
            _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                    new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                    new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);
            sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

            return new(false, $"failed with {ex.ErrorCode}");
        }
        catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.SenderNotAccepted)
        {
            _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);
            sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

            return new(false, $"failed with {ex.ErrorCode}");
        }
        catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.MessageNotAccepted)
        {
            _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);
            sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

            return new(false, $"failed with {ex.ErrorCode}");
        }
        catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.UnexpectedStatusCode)
        {
            _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);
            sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

            return new(false, $"failed with {ex.ErrorCode}");
        }
        catch (SmtpProtocolException ex)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to smtp protocol exception", emailRequest.MessageId);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);

            return new(false, ex.Message);
        }
        catch (IOException ex)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to an IO exception", emailRequest.MessageId);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds);
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);

            return new(false, ex.Message);
        }
        catch (Exception ex) when (ex is OperationCanceledException || ex is TimeoutException)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to a timeout", emailRequest.MessageId);
            _exceptionsCounter.Add(1,
                    new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                    new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);

            return new(false, "timeout");
        }
        finally
        {
            sendActivity?.Dispose(); // avoid nesting of Disconnect under the Send activity
            using var _ = source.StartActivity("Smtp.Disconnect", ActivityKind.Client);
            await _client.DisconnectAsync(true, cts.Token);
        }

        return new(true, "sent");
    }

    private static MimeMessage CreateMimeMessage(SmtpConfig config, EmailRequest emailRequest)
    {
        var message = new MimeMessage();
        message.From.Add(new MailboxAddress(config.FromName, config.From));
        message.To.Add(new MailboxAddress(emailRequest.RecipientName, emailRequest.Recipient));
        message.Subject = emailRequest.Subject;

        message.Body = new TextPart("plain")
        {
            Text = emailRequest.Body
        };
        return message;
    }

    protected virtual void Dispose(bool disposing)
    {
        if (!_disposedValue)
        {
            if (disposing)
            {
                _logger.LogWarning("disposing of connection");
                _client?.Dispose();
            }
            _disposedValue = true;
        }
    }

    public void Dispose()
    {
        // Do not change this code. Put cleanup code in 'Dispose(bool disposing)' method
        Dispose(disposing: true);
        GC.SuppressFinalize(this);
    }
}
jstedfast commented 7 months ago

@EugeneKrapivin Thanks for taking the time to respond and share your ideas. I'll definitely be trying to read up on Microsoft's guides on best practices.

prashanth-nani commented 7 months ago

Thanks, @jstedfast, for considering adding Telemetry support in Mailkit v5.0. All other libraries that we use to make external calls, such as HttpClient for HTTP calls, MassTransit for RabbitMQ, Stackexchange.Redis for Redis calls, and MysqlConnector for MySQL, have added support for telemetry, and we have started instrumenting them. It would be great if we could have this added in Mailkit.

jstedfast commented 7 months ago

Ah, cool, I should look at how HttpClient does stuff as that might make be a close match with MailKit's client APIs.

jstedfast commented 6 months ago

Some docs on dotnet's built-in networking (including HttpClient) metrics: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/built-in-metrics-system-net#metric-httpclientopen_connections

Also seems like this Meter stuff is only in net8.0?

prashanth-nani commented 6 months ago

No. We are running .NET 6 and using Meter for producing custom application metrics. What’s new in .NET 8 is the IMeterFactory interface.

jstedfast commented 6 months ago

A few things I'm not clear about:

  1. A Meter has a name and is used to create Counters and Histograms, which also have names. How do those names inter-relate? If I have a Meter with the name mailkit.net.smtp, does that in any way namespace the Counters and Histograms? Or would I want to name them mailkit.net.smtp.client.send.duration?
  2. Should these Meters, Counters, and Histograms be static on SmtpClient? Or should they be per-instance? If they are per-instance, how does that play out if you've got multiple client instances sending messages?
jstedfast commented 6 months ago

I would still like answers to my questions in the previous comment, but these are the metrics that I've come up with so far (mostly focused on SmtpClient at the moment since that's what you guys are focused on):

Socket Metrics

Metric: mailkit.net.socket.connect

Name Instrument Type Unit Description
mailkit.net.socket.connect Histogram ms The number of milliseconds taken for a socket to connect to a remote host.
Attribute Type Description Examples Presence
socket.connect.result string The connection result. succeeded, failed, cancelled Always
network.peer.address string Peer IP address of the socket connection. 10.5.3.2 Always
server.address string The host name that the socket is connecting to. smtp.gmail.com Always
server.port int The port that the socket is connecting to. 465 Always
error.type string The type of exception encountered. System.Net.Sockets.SocketException If an error occurred.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric measures the time it takes to connect a socket to a remote host.

SmtpClient Metrics

Metric: mailkit.net.smtp.client.connect.counter

Name Instrument Type Unit Description
mailkit.net.smtp.client.connect.counter Counter {connection} Number of outbound SMTP connections that are currently active.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always

This metric tracks the number of successful connections made by the SmtpClient.

Metric: mailkit.net.smtp.client.connect.failures

Name Instrument Type Unit Description
mailkit.net.smtp.client.connect.failures Counter {failure} The number of times a connection failed to be established to an SMTP server.
Attribute Type Description Examples Presence
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric counts the number of failed connection attempts made by the SmtpClient.

Metric: mailkit.net.smtp.client.session.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.session.duration Histogram s The duration of successfully established connections to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric tracks the session duration of each SmtpClient connection and records whether that session ended voluntarily or due to an error.

Metric: mailkit.net.smtp.client.send.messages

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.messages Counter {message} The number of messages successfully sent to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always

This metric tracks the number of messages successfully sent by an SmtpClient.

Metric: mailkit.net.smtp.client.send.failures

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.failures Counter {failure} The number of times that sending a message to an SMTP server failed.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric tracks the number of times an SmtpClient encountered an error while trying to send a message.

Metric: mailkit.net.smtp.client.send.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.duration Histogram ms The amount of time it takes to send a message to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric tracks the amount of time that it takes to send a message and records any error details if the message was not sent successfully.

jstedfast commented 6 months ago

First, I'd like to say that I'm not completely happy with the metric names I've come up with, especially the .counter and .failures metric names.

I'd also like to get some feedback on attribute names/values, especially the Type for both smtp.status_code and socket.error.

I started out with socket.error and mapped those SocketError enum values to string because the interger values are worthless to me. That said, the same is probably true for the SmtpStatusCode enum values that I decided to map to int for most of you guys.

Does it make sense for both of these to be int? In hindsight, I'm thinking it does.

The Socket Metrics are because the way I implemented it was to implement it at the MailKit.Net.SocketUtils-level rather than it the higher-level to avoid duplicating as much code. Also if a connect is aborted due to the SmtpClient.Timeout value or the CancellationToken, we don't actually get accurate information if I was to do it at the SmtpClient-level.

Thinking back on my questions from an earlier comment, I'm guessing all of my metrics should be "global" and not "instance", but I'm not 100% positive on that.

Then of course, if they are "global", then does it make sense to add attributes to determine which instance a metric was logged from? e.g. perhaps have a Guid (or int?) clientId?

jstedfast commented 6 months ago

Ok, so reading over https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/metrics.md#naming-rules-for-counters-and-updowncounters

I guess I should probably make the following name changes:

(I think technically failures is ok, but maybe faults is better?)

I also wonder if maybe smtp.client.session.duration should just become smtp.client.duration?

EugeneKrapivin commented 6 months ago

instead of using faults use a tag/label status=failed it will be easier to dissect the metrics using tags. if you go back to my earlier example, you'd see how I used tags to add extra data on top of the metric.

mailkit.net.client.send.count {status=[success|failure], protocol=[smtp|pop3]}
mailkit.net.client.send.rate {status=[success|failure], protocol=[smtp|pop3]} 
mailkit.net.client.send_duration {status=[success|failure], protocol=[smtp|pop3], unit=ms}

just an idea.

PS: a bit of math :) a linearly increasing counter could give you a rate using a derivative operation. a rate meter could give you a count using an integral operation. sry for the math tidbits 🤣

jstedfast commented 6 months ago

Okay, so I'll drop the .failures.

That still doesn't answer the most important question which is should these instruments (Counters/Histograms) be per-instance or global?

jstedfast commented 6 months ago

Just to bookmark this for myself, this is the code for metrics reporting in HttpClient:

jstedfast commented 6 months ago

Updated list of metrics for SmtpClient:

SmtpClient Metrics

Metric: mailkit.net.smtp.client.connect.count

Name Instrument Type Unit Description
mailkit.net.smtp.client.connect.count Counter {attempt} The number of times a client has attempted to connect to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. System.Net.Sockets.SocketException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the number of times an SmtpClient has attempted to connect to an SMTP server.

Metric: mailkit.net.smtp.client.connect.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.connect.duration Histogram ms The amount of time it takes for the client to connect to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. System.Net.Sockets.SocketException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the amount of time it takes an SmtpClient to connect to an SMTP server.

Metric: mailkit.net.smtp.client.connection.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.connection.duration Histogram s The duration of successfully established connections to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the connection duration of each SmtpClient connection and records any error details if the connection was terminated involuntarily.

Metric: mailkit.net.smtp.client.send.count

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.count Counter {message} The number of messages sent to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the number of messages sent by an SmtpClient and records any error details if the message was not sent successfully.

Metric: mailkit.net.smtp.client.send.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.duration Histogram ms The amount of time it takes to send a message to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the amount of time that it takes an SmtpClient to send a message and records any error details if the message was not sent successfully.

EugeneKrapivin commented 6 months ago

That still doesn't answer the most important question which is should these instruments (Counters/Histograms) be per-instance or global?

I'm sorry missed that question. I think you should share the Meters between instances, and it looks like the documentation says the APIs are thread-safe. I'd share the Meter (or use IMeterFactory which exists in net8 only😞), and create the instruments (counters, meters, histograms) in the constructor of the instance itself. But I'd be really happy to get some community input on this because I'm not sure whether this is the best practice.

The APIs to create instruments and record measurements are thread-safe. In .NET libraries, most instance methods require synchronization when invoked on the same object from multiple threads, but that's not needed in this case.

source: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation

I've seen your table, thats impressive :)

You should be very careful with tags with high cardinality: like IP and hostname. These might cause strain on the metrics database which may lead to degraded performance on the DB. I've seen (and made) mistakes where runaways tags values would cause crappy performance in influxDB, I'd assume the same could happen in Prometheus as well. source: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation#:~:text=Beware%20of%20having%20very%20large%20or%20unbounded%20combinations%20of%20tag%20values%20being%20recorded%20in%20practice.

jstedfast commented 6 months ago

You should be very careful with tags with high cardinality: like IP and hostname. These might cause strain on the metrics database which may lead to degraded performance on the DB.

I saw that in the docs. I'm not sure if it's worth reporting the network.peer.address, but other than the socket.error code, I'm following the HttpClient pattern.

All that said, now that I'm looking more at ActivitySource and was also working on adding metrics for Authenticate(), I'm wondering... will this balloon into insanity? HttpClient literally only has 3 metrics:

public readonly UpDownCounter<long> OpenConnections = meter.CreateUpDownCounter<long>(
            name: "http.client.open_connections",
            unit: "{connection}",
            description: "Number of outbound HTTP connections that are currently active or idle on the client.");

        public readonly Histogram<double> ConnectionDuration = meter.CreateHistogram<double>(
            name: "http.client.connection.duration",
            unit: "s",
            description: "The duration of successfully established outbound HTTP connections.");

        public readonly Histogram<double> RequestsQueueDuration = meter.CreateHistogram<double>(
            name: "http.client.request.time_in_queue",
            unit: "s",
            description: "The amount of time requests spent on a queue waiting for an available connection.");

And then there's the issue of nastiness from #if NET6_0_OR_GREATER everywhere because I still need to support netstandard2.0/2.1 and net4x. Things are gonna get real ugly once I start adding ActivitySource.StartActivity()

jstedfast commented 6 months ago

I could probably kill the connect.count and send.count metrics, right?

We could keep connect.duration and maybe kill the socket.error and smtp.status_code tags and just leave error.type (but maybe rename to exception.type)?

I could add authenticate.duration (something I just started adding) but not bother with authenticate.count.

SmtpClient has very little worth tracing (Send() is pretty much it other than maybe Verify() and Expand() but those are rarely used).

Pop3Client gets more complex with many more "activities" and commands that map to Pop3Client methods, so if every public API gets a metric and an activity, things will explode in code I need to write.

Then ImapClient... yikes, I'll go to my grave being unfinished with adding a metric/activity for each operation that there's a public API for.

prashanth-nani commented 6 months ago

I'd highly recommend keeping the send.count. This would help us in creating dashboards for the number of messages sent.

MassTransit exports messages sent and receive counters https://masstransit.io/documentation/configuration/observability

jstedfast commented 6 months ago

I'm just curious, but is anyone wanting tracing and metrics (and logging?) from MailKit doing anything but SMTP?

EugeneKrapivin commented 6 months ago

I saw that in the docs. I'm not sure if it's worth reporting the network.peer.address, but other than the socket.error code, I'm following the HttpClient pattern.

I think starting with less metrics with good tags is an amazing starting point. Better than adding useless metrics and then trying to get rid of them. As a good rule of thumb I'm always looking for RED metrics because it is really easy to build meaningful alerts on top of them. Another very useful metric would probably be # active connections # of created clients, these might be an indicator of wrong library usage (maybe?).

I'm just curious, but is anyone wanting tracing and metrics (and logging?) from MailKit doing anything but SMTP?

if you assume that mainly SMTP is used, it would be a good start point and then just wait for community feedback/MRs for pop and imap. In any case, logging and traces should be part of the observability suite, so that each call could be traced to a successful/failed email sending and to find logs that caused it :)

jstedfast commented 6 months ago

Let us take a moment to rewind a bit and take a step back and discuss what is actually needed because I think I'm perhaps overengineering this based on how little HttpClient seems to report.

Lets start with something that @EugeneKrapivin said in the beginning as well:

This is probably a huge ask, and I think it should be opened for discussion to see if it's worth the investment (aka will users of the library use it).

Part of the reason I'm asking about SMTP is because I get the distinct feeling that everyone in this discussion thread so far is entirely focused on SMTP and, honestly, that probably makes the most sense because many MailKit users are coming from the old System.Net.Mail.SmtpClient APIs.

Part of me wants to say "just wrap SmtpClient.Connect/Authenticate/Send in your own custom metrics/tracing/logging logic" because I'm not entirely sure that lower-level details are all that useful or interesting to you guys (they would be to me, but I'm not wearing the App Developer or Services Developer hat when I'm working on MailKit).

That said, if 100's of developers are going to be wrapping SmtpClient.Connect/Authenticate/Send/etc in order to collect basic metrics/tracing/logging data, then it probably really does make sense for MailKit to just add it to reduce all this code duplication.

So in that sense, I can see this "enhancement request" as being a "value-add" for MailKit and that's what interests me.

Now on to design.

If I was designing this for myself (as in, if this telemetry data was going to my own personal Azure Data Explorer), it's pretty much a guarantee that the information that I'd want is hugely different from what you guys want/need. At best, what you guys want/need will be a tiny subset of what I'd want. At worst, the data we want might be completely different.

For example, I'm not sure I'd care at all about # of messages sent. Sure, it'd be a fun statistic I could use for bragging rights, but not much else. But you guys seem to care about this metric.

Even Send/SendAsync() duration is not hugely useful for me. It's not completely useless, but on its own, it won't tell me if the time spent sending the message is due to network slowness, server response latency, the number of recipients being huge (e.g. 100's of recipients in the To/Cc/Bcc lists which each require a separate RCPT TO command), or if it's just the size of the message being 400 MB vs 4K.

These details matter (especially for my uses debugging/fixing MailKit issues).

Tracing and Logging would bring context into the picture, but again, what details do you guys care about?

Then again, presumably you guys will also already have some sort of Logging and Tracing infrastructure in your apps/services that will presumably already provide some context that would overlap whatever I add to MailKit.

So my question is: What value can MailKit provide here? Is it some details/context that your apps/services can't easily provide at a higher level? Or is it more to avoid code duplication/"boilerplate"?

I'm guessing it has more to do with reducing "boilerplate"/duplication between services than anything else.

jstedfast commented 6 months ago

I think starting with less metrics with good tags is an amazing starting point. Better than adding useless metrics and then trying to get rid of them. As a good rule of thumb I'm always looking for RED metrics because it is really easy to build meaningful alerts on top of them.

Thanks for the link. I'm skimming over it now (I'll read it more in depth later when I have time).

Another very useful metric would probably be # active connections # of created clients, these might be an indicator of wrong library usage (maybe?).

But is that something that is appropriate for MailKit itself to log? Or something more suitable for application-level code to log?

jstedfast commented 6 months ago

Let me voice a few more thoughts I'm having:

With HttpClient, for example, the metrics they log are resources that they and only they control - stuff that application-level developers can't easily log metrics for.

For example, HttpClient uses a pool of connections that the application developer can be completely unaware of because they have no fine grained control of their creation/deletion/etc. That's all happening under the hood.

For HttpClient, we have the following metrics:

http.client.open_connections

Consumers of the HttpClient APIs really have no ability to log this metric themselves because it's all under the hood. This metric does not count every new HttpClient (...) call. It's a metric that records the number of HttpConnections made by the HttpClients that are being used. These HttpConnections are then kept alive in case they are needed again by any new HttpClient that gets instantiated.

http.client.connection.duration

Again, this is under the hood. This refers to the duration of a time that an "HttpConnection" is kept alive, it's not the amount of time that an HttpClient instance was kept around before being disposed.

http.client.request.time_in_queue

The reason there's a time_in_queue at all is because the HttpClient doesn't send the HttpRequests itself, it adds them to a dispatch queue that one of the "HttpConnection"s will eventually grab and send when it becomes available.

These aren't metrics that you, as an apllication/web-service developer, can log.

They aren't the same as SmtpClient.Connect/Authenticate/Send because an SmtpClient instance isn't a middle-man for a pool of "SmtpConnections" that are running in the background that consume requests from a queue that SmtpClient feeds into.

prashanth-nani commented 6 months ago

Great points! Let me share my specific use case and why I ended up on this issue in the first place.

Within our application, users can set up their SMTP settings for outgoing mail and IMAP for accessing their mailboxes. We faced significant delays in sending emails with a specific configuration used by one user who uses Sparkpost for outgoing emails. While the emails were eventually sent, the delays were abnormal and only occurred with this particular configuration. I initiated a support ticket with Sparkpost, but they requested additional information since they didn't detect any anomalies on their end. Initially unsure about where to investigate, I began exploring whether Mailkit provides any metrics that could help diagnose the issue. Thus, my journey led me to this discussion.

prashanth-nani commented 6 months ago

In solving this issue, it's possible that we only need traces since we're aware of the problem and need to identify the specific troublesome trace. However, having metrics would be beneficial for configuring proactive alerts and receiving notifications to address these issues promptly.

jstedfast commented 6 months ago

Okay, so for a metric like send.duration - would it make sense to include recipient.count and/or message.size?

Or would it be better to have the metric tags themselves be more minimalist and include recipient.count and message.size be included in some other form of telemetry data such as a trace or log?

prashanth-nani commented 6 months ago

I’m not entirely sure about the best course of action here. However, I'm leaning towards having more details in the traces rather than in the metrics. Personally, I'd rather avoid mixing external library logs with my app logs as it seems somewhat unconventional to me. Let’s consider adding metrics only when we feel they are essential for monitoring, as suggested by @EugeneKrapivin. If uncertain, let's refrain from adding it for now and revisit the decision when needed.

EugeneKrapivin commented 6 months ago

I assume we'd like data that will help us monitor and troubleshoot the service faster, preferably before issues on the client side begin to bubble :)

For example, I'm not sure I'd care at all about # of messages sent. Sure, it'd be a fun statistic I could use for bragging rights, but not much else. But you guys seem to care about this metric.

it is an integral of the rate metric, so as long as you report 1 (either count or rate) you could calculate the other simply

These are prometheus queries with grafana variables that I copied form a board we have to monitor number exceptions thrown by a service in production:

process_runtime_dotnet_exceptions_count{dc='$dc', env='$env',name='$service', version=~'$ver'}

image the process_runtime_dotnet_exceptions_count is a dotnet metric reported by the runtime , as you can clearly see, it is a counter, linear counter. with a simple rate function on this series we could get rate of exceptions image

Even Send/SendAsync() duration is not hugely useful for me

metrics like these are actually useful for developers who use MailKit, they could report those metrics and build useful alerts based on this metric. If the duration of a send request is increasing overtime its a cause for alarm, seeing quantiles (p99.9) of the duration also helps find issues, especially spikes that get lost in the standard average/mean aggregation people look at.

Tracing and Logging would bring context into the picture, but again, what details do you guys care about?

it'd be really helpful to get outgoing client requests as activities (trace) because it would allow the user to trace a single request and whether or not the email was sent successfully. that's (distributed traces) actually something I'm using daily while researching production issues.

var start = Stopwatch.GetTimestamp();
using var sendActivity = source.StartActivity("Smtp.Send", ActivityKind.Client);
try
{
    var r = await _client.SendAsync(message, cts.Token);

    var total = Stopwatch.GetElapsedTime(start);

    sendActivity?.AddTag("status", "success");
    sendActivity?.AddEvent(new ActivityEvent("email sent"));
    _sendMessageHistogram.Record(total.TotalMilliseconds new KeyValuePair<string, object?>("status", "success"));
}

Especially since Traces and Logs are correlated using a RequestId parameter, meaning I could trace a request and find all logs written as a side-effect of the call chain.

In general, metrics are only a partial view of the whole observability game which must include logs, traces and metrics :).

I say you could have only 2-3 metrics, with good tags just to get you up and running with RED metrics. Get those logs out the door to the logging tool that the developers use and get those traces with dotnet Activity SDK. It doesn't have all to be at once, it could be gradual. I'd suggest sticking to:

instrument type metric name tags
counter mailkit.net.client.op.count client=[smtp|pop3|imap], status=[success|failed], operation=[send|connect|auth|noop]
histogram mailkit.net.client.op.duration client=[smtp|pop3|imap], status=[success|failed], unit=ms, operation=[send|connect|auth|noop]
histogram mailkit.net.client.send.size {client=[smtp|pop3|imap], status=[success|failed], unit=byte}

of course I do not know the internals of the library so if there are other metrics that you think that might be useful for the users of the library, you should add them. the metrics I suggested would allow a user to:

of course all of those I could gather manually from outside the library by simply wrapping it with lots of code... but that's the added value you've mentioned :)

Regarding traces, just instrument the most used (in your opinion) client and see if requests or MR for other clients arrive =) if you build it, they will come 🤣

Regarding logs, I strongly disagree with @prashanth-nani about library logs in my product logs, I think that once I use the library in my code, it is my job to monitor, read and understand its behavior. This is why all of our logs are gathered into elasticsearch under a single request ID for each and every request (pricy endeavor for sure). these include all the logs that are written by a running process. If there is something I don't want I just filter it in the log4net/logging/logstash/flume forwarders and drop it before it gets stored. logs from MailKit is as important as your log in the service once you use MailKit in production.

@jstedfast I think you should be using the Microsoft logging abstraction, it'd allow the user to use any logger they want (all loggers implement the ILogger abstraction. In addition, those logs could be easily exported to OpenTelemetry collector easily in dotnet. Of course all you have to do is use the standard ILogger logging abstraction, all the exporting logic is not your job, just to be clear.

wow this got fairly long. sorry.

prashanth-nani commented 6 months ago

@EugeneKrapivin, thanks for the detailed explanation! I fully agree with your points.

I stand corrected regarding my previous thoughts on library logs. Including them where ever necessary using ILogger is a great idea as it is configurable in the application.

On instrumentation, I align with your suggestions for the traces and RED metrics.

Additionally, message.size tag (and any other crucial information that could cause delays) in the trace activities, and exceptions as trace events would be valuable to correlate with longer duration traces.

Thanks again for your valuable insights!

EugeneKrapivin commented 6 months ago

Additionally, message.size tag (and any other crucial information that could cause delays) in the trace activities, and exceptions as trace events would be valuable to correlate with longer duration traces.

just be careful with tags with variadic values on metrics - it will explode the backend storage for sure :) it should be OK as an argument in Activity/Event trace

jstedfast commented 6 months ago

@EugeneKrapivin Thanks for your thoughts, as always. I really appreciate the guidance because I'm not well-versed in this stuff.

What you said makes sense and brings the bigger picture into better focus for me.

jstedfast commented 6 months ago

Here's where I'm at with telemetry metrics: https://github.com/jstedfast/MailKit/blob/metrics/Telemetry.md

A lot of this is based on HttpClient and reading up on the OpenTelemetry specifications, conventions, etc.

I've also got Activities that span Connect/Async, Authenticate/Async and Send/Async.

EugeneKrapivin commented 6 months ago

@jstedfast would you mind opening a draft merge/pull request so we could productively leave comments?

jstedfast commented 6 months ago

Done: https://github.com/jstedfast/MailKit/pull/1715