Open pjenvey opened 4 years ago
Looking at Overall Sync Failure Name, there appears to be similar differences between mysql_nserror
vs spanner_nserror
(disabling mysql/spanner_success
makes this clearer).
Also similarities w/ spanner_httperror
vs mysql_httperror
.
I believe nserror appears to be a generic class of errors, and nserror
here indicates a DNS lookup failurehttperror
looks like any kind of error response.
I'll see if I can dig up the error codes. @pjenvey do you have a hard timeline for this or just sooner rather than later. I should be able to take a peek by EOW
@irrationalagent Just sooner rather than later, thanks!
I had a quick play with this. 'nserror' codes are errors from the desktop platform - in most cases for sync, these error codes are actually network errors - errors where we can't even start a http request (and as phil says, httperror
typically means a http error response, as opposed to a failure to actually make the request at all.)
Finding these code in the tree is tricky, but http://james-ross.co.uk/mozilla/misc/nserror is good enough. However, the "most popular" ones are:
2152398878 = NS_ERROR_UNKNOWN_HOST 2152398861 = NS_ERROR_CONNECTION_REFUSED 2152398862 = NS_ERROR_NET_TIMEOUT 2147500036 = NS_ERROR_ABORT
I forked that query, then changed the sql to look only at nserror and I see NS_ERROR_UNKNOWN_HOST - I tried to publish it and it warned me I was overwriting something else - which surprised me as I forked it - but I was worried I was overwriting something bad, and long story short, I got confused :) So I think https://sql.telemetry.mozilla.org/queries/70710/source#177932 isn't going to work for you, so I pasted the SQL below.
Long story short, spanner has those 4 error codes in higher numbers which accounts for most of the different. I've no idea why that would be the case though.
My hacks to the sql - only look at nserror and put the code before the node type
WITH
d as (
SELECT
timestamp_trunc(submission_timestamp,hour) as hour,
case when s.failure_reason is null then concat('success', '_', case when payload.sync_node_type is null then 'unknown' else payload.sync_node_type end)
else concat(s.failure_reason.code, '_', case when payload.sync_node_type is null then 'unknown' else payload.sync_node_type end) end as node_type_sync_failure_reason,
count(*) as syncs_with_name,
count(distinct payload.uid) as users_with_name
from `moz-fx-data-shared-prod.telemetry_live.sync_v4`
CROSS JOIN UNNEST(payload.syncs) s
where s.failure_reason.name = "nserror"
AND date(submission_timestamp) >= date(2020,1,9)
AND normalized_channel in ('nightly', 'beta', 'release')
AND safe_cast(substr(application.version,1,2) as int64) >= 73
group by 1,2 order by 1,2
),
totals as (
SELECT
timestamp_trunc(submission_timestamp,hour) as hour,
case when payload.sync_node_type is null then 'unknown' else payload.sync_node_type end as sync_node_type,
count(*) as total_syncs,
count(distinct payload.uid) as total_users
from `moz-fx-data-shared-prod.telemetry_live.sync_v4`
CROSS JOIN UNNEST(payload.syncs) s
where date(submission_timestamp) >= date(2020,1,9)
AND normalized_channel in ('nightly', 'beta', 'release')
AND safe_cast(substr(application.version,1,2) as int64) >= 73
group by 1,2
)
SELECT
d.*,
t.total_users,
t.total_syncs,
t.sync_node_type,
d.syncs_with_name / t.total_syncs as proportion_of_syncs_with_failure_reason,
d.users_with_name / t.total_users as proportion_of_users_with_failure_reason
FROM d
INNER JOIN totals t
ON d.hour = t.hour AND split(d.node_type_sync_failure_reason, '_')[OFFSET(1)] = t.sync_node_type
WHERE t.sync_node_type != 'unknown'
order by 1,2
Per @davismtl,
Telemetry shows Spanner nodes having a slightly higher ratio of Number of Failed Syncs / Number of Sync. The difference has slightly increased over the past month.