mozilla / gcp-ingestion

Documentation and implementation of telemetry ingestion on Google Cloud Platform
https://mozilla.github.io/gcp-ingestion/
Mozilla Public License 2.0
75 stars 32 forks source link

Null pointer exception with raw streams republisher #964

Closed whd closed 4 years ago

whd commented 4 years ago

Example error GCS path: gs://moz-fx-data-prod-data/telemetry-raw_republisher/error/2019-10-31/18/error-2019-10-31T18-30-00.000Z-2019-10-31T18-40-00.000Z-0-00039-of-00060.ndjson.gz

Stack trace:

java.net.URI$Parser.parse(URI.java:3042)
java.net.URI.<init>(URI.java:588)
java.net.URI.create(URI.java:850)
org.apache.beam.sdk.options.ValueProvider$NestedValueProvider.get(ValueProvider.java:129)
com.mozilla.telemetry.decoder.Deduplicate$MarkAsSeen.processElement(Deduplicate.java:220)
com.mozilla.telemetry.decoder.Deduplicate$MarkAsSeen.processElement(Deduplicate.java:189)
com.mozilla.telemetry.transforms.MapElementsWithErrors$DoFnWithErrors.processElementOrError(MapElementsWithErrors.java:85)
com.mozilla.telemetry.transforms.MapElementsWithErrors$DoFnWithErrors$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218)
org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:183)
org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:335)
org.apache....

This was mentioned in https://github.com/mozilla/gcp-ingestion/issues/945 but appears to affect more data than previously thought i.e. O(TB/day), perhaps all data. Data is making it to stage and this looks to be a failure while attempting to deduplicate, which should be skipped for this job as redisUri isn't provided. Here are the template build and runtime configurations in case this is a template issue. The result is a bunch of GCS errors that have been safely ignoring but that are costing us some money to store.

jklukas commented 4 years ago

The structure of the republisher job is that it reads input then branches to various destinations. For the raw republisher, the two destinations are Deduplicate.MarkAsSeen and the random sampler. It looks like every message is raising an exception in Deduplicate.MarkAsSeen, but this has no effect on the other path, so the republisher is still functioning fine.

In reading through the code again, it seems like we should be correctly ignoring these messages if the URI is null. So I need to think more about that.