Open sean-rose opened 2 months ago
The assert-telemetry-version
test is currently failing as a side effect of four legacy telemetry ping schemas not having the version
property marked as required:
Should those schemas be changed so the version
property is required?
Should those schemas be changed so the
version
property is required?
Can you check payload_bytes_error
to see how many pings are being dropped because of this and which pings/fields are most frequent? The changes make sense to me but it would be useful to understand the impact
Can you check
payload_bytes_error
to see how many pings are being dropped because of this and which pings/fields are most frequent? The changes make sense to me but it would be useful to understand the impact
Beyond the incident with the distro
property (which you already addressed manually via #810) there doesn't appear to be any significant number of pings dropped recently due to null values that would be accepted with this PR's changes deployed. I checked the cases with >1k dropped pings due to null values in the last month, and all of the associated properties are marked as required and thus wouldn't have their validation changed.
So this is mostly just guarding against future incidents like what happened with the distro
property.
For reference, I used the following query:
select document_namespace, document_type, document_version, error_message, count(*) as error_count
from `moz-fx-data-shared-prod.monitoring.payload_bytes_error_all`
where
date(submission_timestamp) >= current_date() - 31
and error_message like '%found: Null'
and error_message not like '%/distro:%'
group by all
order by error_count desc
Many telemetry properties aren't configured to be required, and if the property is omitted entirely in submitted telemetry then the associated column in BigQuery ends up with a null value. However, if such non-required properties aren't specifically configured to allow null values and telemetry is sent with an explicit null value for the property then validation will fail with an "unexpected type" error and the entire record will be rejected, despite the fact that from the BigQuery perspective it's functionally equivalent to omitting the non-required property entirely. IMO this behavior is detrimental (e.g. DENG-3588).
In order to mitigate the above problem, this PR updates the schema build process to try to automatically identify non-required properties and adjust their configuration to allow null values.
Checklist for reviewer:
./.github/push-to-trigger-integration <username>:<branchname>
For glean changes:
templates/include/glean/CHANGELOG.md
For modifications to schemas in restricted namespaces (see
CODEOWNERS
):