Closed yunapotamus closed 3 years ago
Q - what user/device identifier info do you want to include? We'd want these on SessionStats. Ditto for other common fields (e.g. platform_id, timing).
I'd recommend changing the name away from SessionStats
. Session
will be a concept in our system. Stats
is a subset of `Metrics. It'd be useful to have something that indicates that this is diagnostic, debug, etc.
Here are some other name ideas:
Q - what user/device identifier info do you want to include? We'd want these on SessionStats. Ditto for other common fields (e.g. platform_id, timing).
I recommend we use UIDevice.identifierForVendor
on iOS. It's a UUID unique to the vendor and device, and preserves privacy.
I'd recommend changing the name away from SessionStats. Session will be a concept in our system. Stats is a subset of `Metrics. It'd be useful to have something that indicates that this is diagnostic, debug, etc.
How about MobileDiagnostics
?
Q - Do you want this for web too?
Yes, that makes sense. But web doesn't do batching, right? Would we want separate WebDiagnostics
and MobileDiagnostics
?
I added client_version
and promoted_library_version
to MobileDiagnostics
. We could move these to a different method if people prefer.
If possible, I'd keep the same starting fields: platform_id, user_info, timing.
For the size comment, don't worry about it. The size will be very small. We won’t be joining this in a streaming way with our other records so it's less of a concern.
This looks good. We can add more later if needed.
We'll eventually want a user activity ping (to see if they are still using the app). We can keep that as a separate feature.
Do you also want log_user_id, user_id or any other scopes?
If possible, I'd keep the same starting fields: platform_id, user_info, timing.
Do you mean on the MobileDiagnostics
message? We should be able to infer the scopes from the ancestor ID history. I'll add the other fields.
- For the size comment, don't worry about it. The size will be very small. We won’t be joining this in a streaming way with our other records so it's less of a concern.
- This looks good. We can add more later if needed.
- We'll eventually want a user activity ping (to see if they are still using the app). We can keep that as a separate feature.
Ack. Thanks for the review.
Background
We've encountered several instances where the
logUserID
andviewID
found in delivery RPCs don't match up with any logged client events. This is more common forviewID
. This could be caused by the mobile client failing to send some events, but we have no way of knowing if this is the case in prod right now.Proposal
We can already collect stats about the number of messages sent successfully as well as the number of errors that may have caused failures. This is done via the
Xray
class. The proposal for gathering more data involves expanding the functionality ofXray
to gather these stats for us in production, and then sending those stats back viaLogRequest
.Client changes
Currently,
Xray
has two settings:xrayEnabled: Bool
andxrayExpensiveStackTracesEnabled: Bool
. This proposal replaces both of those settings with a single enum:Also, create a new parameter to
ClientConfig
:For new integrations, we ship with the default Xray level to
batchSummaries
anddiagnosticsIncludeBatchSummaries = true
. If these batch summaries are available, then the logger will automatically populate these summaries inLogRequests
sent to the server. The stats sent in these summaries will be cumulative for the session.Server changes
Introduce a new message on
LogRequest
calledMobileDiagnostics
:TODO: How does the server consume these
MobileDiagnostics
messages? TODO: How much detail is useful for diagnosing problems? Would we want the total number and type of each event logged, for instance?Heartbeat
The
batches_attempted
field also functions as an effective "heartbeat", in that any gaps in network connectivity causing dropped batches would also cause a gap in the incremental values of this field. That is, ifrecord[i].batches_attempted - record[i - 1].batches_attempted > 1
, then we know that some batch messages failed to send, even if no errors were recorded.This does not catch the case where the latest batch fails to send, though.
Short term vs long term
We can ship with the Xray level set to
batchSummaries
on first integration. Once we have confidence in the metrics coming in from a new client integration, we can change this tonone
.Long term goal is to make this setting configurable via remote config. If a client starts to exhibit logging anomalies, we can remotely enable this setting for some or all devices for that client.
Sign off
Work begins when sign-off is received from all of the following: