mozilla-services / services-engineering

Services engineering core repo - Used for issues/docs/etc that don't obviously belong in another repo.
2 stars 1 forks source link

Investigate Spanner failure rates in telemetry #51

Open pjenvey opened 4 years ago

pjenvey commented 4 years ago

Per @davismtl,

Telemetry shows Spanner nodes having a slightly higher ratio of Number of Failed Syncs / Number of Sync. The difference has slightly increased over the past month.

pjenvey commented 4 years ago

Looking at Overall Sync Failure Name, there appears to be similar differences between mysql_nserror vs spanner_nserror (disabling mysql/spanner_success makes this clearer).

Also similarities w/ spanner_httperror vs mysql_httperror.

I believe nserror here indicates a DNS lookup failure nserror appears to be a generic class of errors, and httperror looks like any kind of error response.

irrationalagent commented 4 years ago

I'll see if I can dig up the error codes. @pjenvey do you have a hard timeline for this or just sooner rather than later. I should be able to take a peek by EOW

pjenvey commented 4 years ago

@irrationalagent Just sooner rather than later, thanks!

mhammond commented 4 years ago

I had a quick play with this. 'nserror' codes are errors from the desktop platform - in most cases for sync, these error codes are actually network errors - errors where we can't even start a http request (and as phil says, httperror typically means a http error response, as opposed to a failure to actually make the request at all.)

Finding these code in the tree is tricky, but http://james-ross.co.uk/mozilla/misc/nserror is good enough. However, the "most popular" ones are:

2152398878 = NS_ERROR_UNKNOWN_HOST 2152398861 = NS_ERROR_CONNECTION_REFUSED 2152398862 = NS_ERROR_NET_TIMEOUT 2147500036 = NS_ERROR_ABORT

I forked that query, then changed the sql to look only at nserror and I see NS_ERROR_UNKNOWN_HOST - I tried to publish it and it warned me I was overwriting something else - which surprised me as I forked it - but I was worried I was overwriting something bad, and long story short, I got confused :) So I think https://sql.telemetry.mozilla.org/queries/70710/source#177932 isn't going to work for you, so I pasted the SQL below.

Long story short, spanner has those 4 error codes in higher numbers which accounts for most of the different. I've no idea why that would be the case though.

My hacks to the sql - only look at nserror and put the code before the node type

WITH 
    d as (
        SELECT
            timestamp_trunc(submission_timestamp,hour) as hour,
            case when s.failure_reason is null then concat('success', '_', case when payload.sync_node_type is null then 'unknown' else payload.sync_node_type end)
                else concat(s.failure_reason.code, '_', case when payload.sync_node_type is null then 'unknown' else payload.sync_node_type end) end as node_type_sync_failure_reason,
            count(*) as syncs_with_name,
            count(distinct payload.uid) as users_with_name
        from `moz-fx-data-shared-prod.telemetry_live.sync_v4`
        CROSS JOIN UNNEST(payload.syncs) s
        where s.failure_reason.name = "nserror"
        AND date(submission_timestamp) >= date(2020,1,9)
        AND normalized_channel in ('nightly', 'beta', 'release')
        AND safe_cast(substr(application.version,1,2) as int64) >= 73 
        group by 1,2 order by 1,2
    ),

    totals as (
       SELECT
            timestamp_trunc(submission_timestamp,hour) as hour,
            case when payload.sync_node_type is null then 'unknown' else payload.sync_node_type end as sync_node_type,
            count(*) as total_syncs,
            count(distinct payload.uid) as total_users
        from `moz-fx-data-shared-prod.telemetry_live.sync_v4`
        CROSS JOIN UNNEST(payload.syncs) s
        where date(submission_timestamp) >= date(2020,1,9)
        AND normalized_channel in ('nightly', 'beta', 'release')
        AND safe_cast(substr(application.version,1,2) as int64) >= 73 
        group by 1,2

    )

SELECT
    d.*,
    t.total_users,
    t.total_syncs,
    t.sync_node_type,
    d.syncs_with_name / t.total_syncs as proportion_of_syncs_with_failure_reason,
    d.users_with_name / t.total_users as proportion_of_users_with_failure_reason

FROM d 
INNER JOIN totals t
ON d.hour = t.hour AND split(d.node_type_sync_failure_reason, '_')[OFFSET(1)] = t.sync_node_type
WHERE t.sync_node_type != 'unknown'
order by 1,2