mozilla / gcp-ingestion

Documentation and implementation of telemetry ingestion on Google Cloud Platform
https://mozilla.github.io/gcp-ingestion/
Mozilla Public License 2.0
75 stars 31 forks source link

Unrouteable message due to normalization of keys #2609

Open whd opened 3 weeks ago

whd commented 3 weeks ago

/CC @akkomar

Here's a stripped down ping demonstrating the failure. Basically, it appears that this ping passes schema validation treating wr.framebuild_time and wr_framebuild_time as both acceptable but then chokes when reconciling the two at a later step. If you remove one or the other this ping gets routed correctly.

This should probably be treated as an error since a client should not be sending one or the other of these.

{
    "ping_info": {
        "seq": 0,
        "start_time": "2024-06-20T11:50+03:00",
        "end_time": "2024-06-21T10:41+03:00",
        "reason": "overdue"
    },
    "client_info": {
        "telemetry_sdk_build": "52.7.0",
        "app_display_version": "115.12.0",
        "app_channel": "esr",
        "app_build": "20240603145132",
        "architecture": "x86",
        "os": "Windows",
        "os_version": "6.1",
        "first_run_date": "2024-06-18+03:00",
        "build_date": "1970-01-01T00:00:00+00:00",
        "client_id": "e7ee6fd7-ec98-451e-b75e-f8443b46297a",
        "windows_build_number": 7600
    },
    "metrics": {
        "timing_distribution": {
            "wr.framebuild_time": {
                "values": {
                    "0": 0
                },
                "sum": 0
            },
            "wr_framebuild_time": {
                "values": {
                    "0": 0
                },
                "sum": 0
            }
        }
    }
}
extendedStackTrace: "java.lang.ClassCastException: class com.fasterxml.jackson.databind.node.ArrayNode cannot be cast to class com.fasterxml.jackson.databind.node.ObjectNode (com.fasterxml.jackson.databind.node.ArrayNode and com.fasterxml.jackson.databind.node.ObjectNode are in unnamed module of loader 'app')
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.processField(PubsubMessageToObjectNode.java:439) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.transformForBqSchema(PubsubMessageToObjectNode.java:398) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.processField(PubsubMessageToObjectNode.java:458) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.transformForBqSchema(PubsubMessageToObjectNode.java:398) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.processField(PubsubMessageToObjectNode.java:458) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.transformForBqSchema(PubsubMessageToObjectNode.java:398) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.processField(PubsubMessageToObjectNode.java:458) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.transformForBqSchema(PubsubMessageToObjectNode.java:398) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.core.transform.PubsubMessageToObjectNode$Payload.apply(PubsubMessageToObjectNode.java:344) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.io.Gcs$Write$Ndjson$LazyEncodedInput.encode(Gcs.java:67) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.io.Gcs$Write$LazyEncodedInput.get(Gcs.java:116) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.io.Gcs$Write$Batch.getByteSize(Gcs.java:158) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.io.Gcs$Write$Batch.getByteSize(Gcs.java:122) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.util.BatchWrite$Batch.add(BatchWrite.java:210) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.util.BatchWrite.lambda$apply$3(BatchWrite.java:157) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1932) ~[?:?]
    at com.mozilla.telemetry.ingestion.sink.util.BatchWrite.apply(BatchWrite.java:155) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.util.BatchWrite.apply(BatchWrite.java:26) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.config.SinkConfig$Output.apply(SinkConfig.java:162) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.mozilla.telemetry.ingestion.sink.config.SinkConfig$Output.apply(SinkConfig.java:136) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) ~[?:?]
    at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) ~[?:?]
    at com.mozilla.telemetry.ingestion.sink.io.Pubsub.lambda$getConsumer$2(Pubsub.java:37) ~[ingestion-sink-0.1-SNAPSHOT.jar:?]
    at com.google.cloud.pubsub.v1.MessageDispatcher$3.run(MessageDispatcher.java:441) ~[google-cloud-pubsub-1.124.1.jar:1.124.1]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    at java.lang.Thread.run(Thread.java:829) ~[?:?]
"
localizedMessage: "class com.fasterxml.jackson.databind.node.ArrayNode cannot be cast to class com.fasterxml.jackson.databind.node.ObjectNode (com.fasterxml.jackson.databind.node.ArrayNode and com.fasterxml.jackson.databind.node.ObjectNode are in unnamed module of loader 'app')"
akkomar commented 3 weeks ago

It seems correct that this ping passed validation and did not go into error stream, because generic JSON schemas used there do not contain per-probe fields.

In the sink one of these fields should have gone to additional_properties, but for some reason this is not happening. It probably should be wr_framebuild_time as wr is category of this metric. We are doing some magic to replace dots with underscores in various places of the pipeline and this might be involved here.

I managed to run a unit test in https://github.com/mozilla/gcp-ingestion/compare/sink_poison?expand=1 with the stripped down payload from above that fails with the same error as in production. With a bit of debugging this should let us figure out what's going on. I'll continue looking into this next week.

akkomar commented 3 weeks ago

I have a minimum unit test reproducing this issue in https://github.com/mozilla/gcp-ingestion/pull/2610/commits - first commit adds some new test inputs (passing) to validate how has the sink been working so far, second one adds a single input equivalent to the ping in question here (failing).

The reason is that we normalize field names for BQ and transform JSON to a form compatible with BQ schema in place while iterating over fields.

This is happening only for map types and only when the "dot version" of the field precedes the "underscore version". This is because map structure is the only one that looks different in BQ schemas vs JSON schema. When we encounter the "dot version" of the field it is normalized, its value is transformed for BQ compatibility and inserted into "underscore version" in the processed JSON. Later on when we process the "underscore version" we hit ClassCastException: ArrayNode -> ObjectNode.

In the current implementation, if any other non-map field has a non-normalized equivalent(s) in the payload, the last non-normalized value ends up in the output message. We should ideally make maps work the same. Although if implementation turns out to be too complicated I think we do not need to keep this behavior the same with respect to ordering. That's because glean_parser does not allow dots in the metric name, so such pings are erroneous submissions.

/cc @BenWu

whd commented 3 weeks ago

That's because glean_parser does not allow dots in the metric name, so such pings are erroneous submissions.

@akkomar and I are in agreement that this ping should be treated as an error so I'm dropping it from the production pipeline.

BenWu commented 3 weeks ago

I might be misunderstanding what yo're saying, but isn't wr.framebuild_time the correct field in this case, not wr_framebuild_time? The metric name doesn't allow . but in the json payload it should show up with the . as the separator between the category and name.

In any case, I agree that the it should be updated to handle this

akkomar commented 3 weeks ago

I might be misunderstanding what yo're saying, but isn't wr.framebuild_time the correct field in this case, not wr_framebuild_time? The metric name doesn't allow . but in the json payload it should show up with the . as the separator between the category and name.

I added a comment in https://github.com/mozilla/gcp-ingestion/pull/2610/files/749b2819368d6797c4171c42588c8ec865548b0d#r1651448014 which I hope adds some details.

Yes, wr.framebuild_time is the correct field in this case. However I think whatever we do here we should avoid encoding assumptions and rules to metric names in the sink.