mozilla-services / mozilla-pipeline-schemas

Schemas for Mozilla's data ingestion pipeline and data lake outputs
https://protosaur.dev/mps-deploys/
Other
46 stars 91 forks source link

Update build process to add "null" as an allowed type for non-required properties #811

Open sean-rose opened 2 months ago

sean-rose commented 2 months ago

Many telemetry properties aren't configured to be required, and if the property is omitted entirely in submitted telemetry then the associated column in BigQuery ends up with a null value. However, if such non-required properties aren't specifically configured to allow null values and telemetry is sent with an explicit null value for the property then validation will fail with an "unexpected type" error and the entire record will be rejected, despite the fact that from the BigQuery perspective it's functionally equivalent to omitting the non-required property entirely. IMO this behavior is detrimental (e.g. DENG-3588).

In order to mitigate the above problem, this PR updates the schema build process to try to automatically identify non-required properties and adjust their configuration to allow null values.


Checklist for reviewer:

For glean changes:

For modifications to schemas in restricted namespaces (see CODEOWNERS):

sean-rose commented 2 months ago

The assert-telemetry-version test is currently failing as a side effect of four legacy telemetry ping schemas not having the version property marked as required:

Should those schemas be changed so the version property is required?

sean-rose commented 2 months ago

Should those schemas be changed so the version property is required?

812

BenWu commented 2 months ago

Can you check payload_bytes_error to see how many pings are being dropped because of this and which pings/fields are most frequent? The changes make sense to me but it would be useful to understand the impact

sean-rose commented 2 months ago

Can you check payload_bytes_error to see how many pings are being dropped because of this and which pings/fields are most frequent? The changes make sense to me but it would be useful to understand the impact

Beyond the incident with the distro property (which you already addressed manually via #810) there doesn't appear to be any significant number of pings dropped recently due to null values that would be accepted with this PR's changes deployed. I checked the cases with >1k dropped pings due to null values in the last month, and all of the associated properties are marked as required and thus wouldn't have their validation changed.

So this is mostly just guarding against future incidents like what happened with the distro property.

For reference, I used the following query:

select document_namespace, document_type, document_version, error_message, count(*) as error_count
from `moz-fx-data-shared-prod.monitoring.payload_bytes_error_all`
where
  date(submission_timestamp) >= current_date() - 31
  and error_message like '%found: Null'
  and error_message not like '%/distro:%'
group by all
order by error_count desc