mozilla / gcp-ingestion

Documentation and implementation of telemetry ingestion on Google Cloud Platform
https://mozilla.github.io/gcp-ingestion/
Mozilla Public License 2.0
79 stars 32 forks source link

Discuss: decoder transformations for special cases #640

Open whd opened 5 years ago

whd commented 5 years ago

This was discussed briefly in our Monday meeting with the outcome being to file an issue to hammer out the details.

There are a few classes of special cases where it is perhaps desirable to transform input data in some way beyond what we already do. The ones discussed have been:

  1. The data the client sends is sometimes mismatched in type and we would like to coerce that data to a single canonical type

This is specifically to accommodate bad clients or clients that we can't fix, since getting clients to submit schema-conformant types is obviously preferred.

Examples include: a. pre-account ping scalars (maybe not an issue in GCP due to probe-scraper infrastructure) b. Sync pings c. DSMO schema evolution (implemented late enough on GCP that we started with the changed type).

  1. The data the clients sends is of a type that is similar to, but not the same, as the type supported by our infrastructure

See #633 for an example and relevant discussion.

Some potential resolutions:

  1. Create some sort of schema annotation for fields that should be coerced
  2. Make the default decoder behavior be to coerce fields to the type in schema where possible
  3. Do nothing ingestion-side; promote views and UDFs instead as the method for querying these kinds of data We will likely want to automate the tooling for generating such views and as such an annotation mechanism may still be beneficial. See the discussion in #633 etc. for more on this approach. Fields with type mismatch should remain in additionalProperties or will otherwise be considered decoder errors.
mreid-moz commented 5 years ago

Another example: incompatible GC data

fbertsch commented 5 years ago

There is also expected future support for variable BQ types. Generally I'm not a fan of this approach because it makes querying more difficult, but the fact of the matter is if the data is multiple types then we'd have to deal with that at query time no matter what.

Even if we use UDFs/Views, wouldn't we still need the coercion in the decoder, since the improper schemas would be invalid?