yunapotamus commented 3 years ago

Background

We've encountered several instances where the logUserID and viewID found in delivery RPCs don't match up with any logged client events. This is more common for viewID. This could be caused by the mobile client failing to send some events, but we have no way of knowing if this is the case in prod right now.

Proposal

We can already collect stats about the number of messages sent successfully as well as the number of errors that may have caused failures. This is done via the Xray class. The proposal for gathering more data involves expanding the functionality of Xray to gather these stats for us in production, and then sending those stats back via LogRequest.

Client changes

Currently, Xray has two settings: xrayEnabled: Bool and xrayExpensiveStackTracesEnabled: Bool. This proposal replaces both of those settings with a single enum:

@objc(PROXrayLevel)
public enum XrayLevel: Int {
  // Don't gather any Xray stats data at all.
  case none
  // Gather overall counts for the session for each batch.
  // ie. batches: 40, batches sent successfully: 39, errors: 1
  case batchSummaries
  // Gather stats and logged messages for each call made
  // to the metrics library.
  // Corresponds to the current `xrayEnabled` flag
  case callDetails
  // Gathers stats and logged messages for each call made
  // to the metrics library, as well as stack traces where the
  // calls were made.
  // Corresponds to the current `xrayExpensiveStackTracesEnabled` flag
  case callDetailsAndStackTraces
}

Also, create a new parameter to ClientConfig:

public class ClientConfig {
  public var diagnosticsIncludeBatchSummaries: Bool = false
}

For new integrations, we ship with the default Xray level to batchSummaries and diagnosticsIncludeBatchSummaries = true. If these batch summaries are available, then the logger will automatically populate these summaries in LogRequests sent to the server. The stats sent in these summaries will be cumulative for the session.

Server changes

Introduce a new message on LogRequest called MobileDiagnostics:

message IOSError {
  int32 code = 1;
  string domain = 2;
  string description = 3;
  int32 batch_number = 4;
}

message IOSErrorHistory {
  // Window of latest 10 errors.
  repeated IOSError error_history = 1;
  // Total number of errors encountered.
  int32 total_errors = 2;
}

message MobileDiagnostics {
  // Common fields.
  uint64 platform_id = 1;
  common.UserInfo user_info = 2;
  common.Timing timing = 3;
  common.ClientInfo client_info = 4;
  // Identifier for device sending this message.
  string device_identifier = 5;
  // Version identifier for client app.
  string client_version = 6;
  // Version of Promoted library.
  string promoted_library_version = 7;
  // Number of batch logs attempted.
  int32 batches_attempted = 8;
  // Number of batch logs sent successfully.
  int32 batches_sent_successfully = 9;
  // Number of batch logs that failed to send.
  int32 batches_with_errors = 10;
  // Error history.
  IOSErrorHistory ios_error_history = 11;
  // Ancestor ID history. See #133.
  AncestorIdHistory ancestor_id_history = 12;
}

message LogRequest {
  MobileDiagnostics mobile_diagnostics = n;
}

TODO: How does the server consume these MobileDiagnostics messages? TODO: How much detail is useful for diagnosing problems? Would we want the total number and type of each event logged, for instance?

Heartbeat

The batches_attempted field also functions as an effective "heartbeat", in that any gaps in network connectivity causing dropped batches would also cause a gap in the incremental values of this field. That is, if record[i].batches_attempted - record[i - 1].batches_attempted > 1, then we know that some batch messages failed to send, even if no errors were recorded.

This does not catch the case where the latest batch fails to send, though.

Short term vs long term

We can ship with the Xray level set to batchSummaries on first integration. Once we have confidence in the metrics coming in from a new client integration, we can change this to none.

Long term goal is to make this setting configurable via remote config. If a client starts to exhibit logging anomalies, we can remotely enable this setting for some or all devices for that client.

Sign off

Work begins when sign-off is received from all of the following:

[x] Andrew
[x] Dan
[x] Jin
[x] YW

prm-dan commented 3 years ago

Q - what user/device identifier info do you want to include? We'd want these on SessionStats. Ditto for other common fields (e.g. platform_id, timing).
I'd recommend changing the name away from SessionStats. Session will be a concept in our system. Stats is a subset of `Metrics. It'd be useful to have something that indicates that this is diagnostic, debug, etc.

Here are some other name ideas:

SessionDiagnostics
SessionDiagnosticInfo
SessionDiagnosticSummary
DiagnosticSummary (and other types)
SessionDebugInfo

Q - Do you want this for web too?

prm-dan commented 3 years ago

JFYI - I eventually want to move us to a generic byte array support on LogRequest. I'm fine if that is not done for this change.

yunapotamus commented 3 years ago

Q - what user/device identifier info do you want to include? We'd want these on SessionStats. Ditto for other common fields (e.g. platform_id, timing).

I recommend we use UIDevice.identifierForVendor on iOS. It's a UUID unique to the vendor and device, and preserves privacy.

yunapotamus commented 3 years ago

I'd recommend changing the name away from SessionStats. Session will be a concept in our system. Stats is a subset of `Metrics. It'd be useful to have something that indicates that this is diagnostic, debug, etc.

How about MobileDiagnostics?

yunapotamus commented 3 years ago

Q - Do you want this for web too?

Yes, that makes sense. But web doesn't do batching, right? Would we want separate WebDiagnostics and MobileDiagnostics?

yunapotamus commented 3 years ago

I added client_version and promoted_library_version to MobileDiagnostics. We could move these to a different method if people prefer.

prm-dan commented 3 years ago

Do you also want log_user_id, user_id or any other scopes?

If possible, I'd keep the same starting fields: platform_id, user_info, timing.

For the size comment, don't worry about it. The size will be very small. We won’t be joining this in a streaming way with our other records so it's less of a concern.
This looks good. We can add more later if needed.
We'll eventually want a user activity ping (to see if they are still using the app). We can keep that as a separate feature.

yunapotamus commented 3 years ago

Do you also want log_user_id, user_id or any other scopes?

If possible, I'd keep the same starting fields: platform_id, user_info, timing.

Do you mean on the MobileDiagnostics message? We should be able to infer the scopes from the ancestor ID history. I'll add the other fields.

For the size comment, don't worry about it. The size will be very small. We won’t be joining this in a streaming way with our other records so it's less of a concern.

This looks good. We can add more later if needed.

We'll eventually want a user activity ping (to see if they are still using the app). We can keep that as a separate feature.

Ack. Thanks for the review.

promotedai / ios-metrics-sdk

Create diagnostic tool for SDK #130

Background

Proposal

Client changes

Server changes

Heartbeat

Short term vs long term

Sign off