Open whd opened 3 weeks ago
It seems correct that this ping passed validation and did not go into error stream, because generic JSON schemas used there do not contain per-probe fields.
In the sink one of these fields should have gone to additional_properties
, but for some reason this is not happening. It probably should be wr_framebuild_time
as wr
is category of this metric. We are doing some magic to replace dots with underscores in various places of the pipeline and this might be involved here.
I managed to run a unit test in https://github.com/mozilla/gcp-ingestion/compare/sink_poison?expand=1 with the stripped down payload from above that fails with the same error as in production. With a bit of debugging this should let us figure out what's going on. I'll continue looking into this next week.
I have a minimum unit test reproducing this issue in https://github.com/mozilla/gcp-ingestion/pull/2610/commits - first commit adds some new test inputs (passing) to validate how has the sink been working so far, second one adds a single input equivalent to the ping in question here (failing).
The reason is that we normalize field names for BQ and transform JSON to a form compatible with BQ schema in place while iterating over fields.
This is happening only for map types and only when the "dot version" of the field precedes the "underscore version". This is because map structure is the only one that looks different in BQ schemas vs JSON schema. When we encounter the "dot version" of the field it is normalized, its value is transformed for BQ compatibility and inserted into "underscore version" in the processed JSON. Later on when we process the "underscore version" we hit ClassCastException: ArrayNode -> ObjectNode
.
In the current implementation, if any other non-map field has a non-normalized equivalent(s) in the payload, the last non-normalized value ends up in the output message. We should ideally make maps work the same. Although if implementation turns out to be too complicated I think we do not need to keep this behavior the same with respect to ordering. That's because glean_parser does not allow dots in the metric name, so such pings are erroneous submissions.
/cc @BenWu
That's because glean_parser does not allow dots in the metric name, so such pings are erroneous submissions.
@akkomar and I are in agreement that this ping should be treated as an error so I'm dropping it from the production pipeline.
I might be misunderstanding what yo're saying, but isn't wr.framebuild_time
the correct field in this case, not wr_framebuild_time
? The metric name doesn't allow .
but in the json payload it should show up with the .
as the separator between the category and name.
In any case, I agree that the it should be updated to handle this
I might be misunderstanding what yo're saying, but isn't
wr.framebuild_time
the correct field in this case, notwr_framebuild_time
? The metric name doesn't allow.
but in the json payload it should show up with the.
as the separator between the category and name.
I added a comment in https://github.com/mozilla/gcp-ingestion/pull/2610/files/749b2819368d6797c4171c42588c8ec865548b0d#r1651448014 which I hope adds some details.
Yes, wr.framebuild_time
is the correct field in this case. However I think whatever we do here we should avoid encoding assumptions and rules to metric names in the sink.
/CC @akkomar
Here's a stripped down ping demonstrating the failure. Basically, it appears that this ping passes schema validation treating
wr.framebuild_time
andwr_framebuild_time
as both acceptable but then chokes when reconciling the two at a later step. If you remove one or the other this ping gets routed correctly.This should probably be treated as an error since a client should not be sending one or the other of these.