mozilla / gcp-ingestion

Documentation and implementation of telemetry ingestion on Google Cloud Platform
https://mozilla.github.io/gcp-ingestion/
Mozilla Public License 2.0
79 stars 32 forks source link

Decide on AET ecosystem_anon_id in header vs. payload #1256

Closed jklukas closed 4 years ago

jklukas commented 4 years ago

The draft docs for AET support in the pipeline (https://github.com/mozilla/gcp-ingestion/pull/1109) propose sending ecosystem_anon_id as a header rather than within the payload, but @chutten and I have recently started up a conversation about whether this is appropriate. This issue is intended to explore the possible approaches and lead to a decision.

cc @whd

jklukas commented 4 years ago

Possible Approaches

Given that the interface for clients to send data to the pipeline is a POST request to an http endpoint, and that we must be able to model that request as a Pub/Sub message, I see essentially three ways to send ecosystem_anon_id:

  1. Embed an ecosystem_anon_id field in the JSON payload
  2. Send the value as a header
  3. Include the value as a query param in the URI

The main benefit of using a header is that it provides separation from the payload. We would be able to read the header, decrypt its value, and throw away the original material without ever having to touch the payload. This simplifies error handling, as we don't need to worry about the case where the JSON payload is unparseable; we can defer that problem to the Decoder.

Another concern is about accidentally passing an unprocessed ecosystem_anon_id value to further stages of the pipeline and having it hit BigQuery storage where it could be seen by analytics users, thus breaking the contract of AET. Our pipeline design naturally ignores unexpected HTTP headers whereas it attempts to preserve unexpected payload fields by storing leftover JSON in additional_properties. If a client sends an ecosystem_anon_id value with a typo in the field name, for example, we could easily end up passing that all the way through the pipeline without properly decrypting it.

Length limitations

When I proposed the header-based design, I was focused on the primary case of ecosystem pings that include a single ecosystem_anon_id. It could also be extended to the case of anon_id change events where we would need to send a new ecosystem_anon_id along with a previous_ecosystem_anon_id value as a separate header. That design doesn't consider the case of deletion requests, though, where we've discussed that we might need to send a whole array of previous values. I think at that point, the header-based approach breaks down and we would have to codify the values inside the json payload.

Likewise, I think we can rule out query params due to length limitations. Sending a single ecosystem_anon_id value (up to 800 characters) in the URI would probably be fine, but again this becomes untenable for the case where we need to support an array of previous values for deletion request pings.

Layer violation concerns

@chutten brought up concerns that headers are logically meant to be about details of the transport layer. Sending data as headers that's meant to be processed and eventually merged into the payload violates expected layering.

Schema management concerns

If we put ecosystem_anon_id values into the payload, we end up with a different expected schema at the EcosystemDecryptor stage vs. at the Decoder and BigQuery stages. The input to EcosystemDecryptor should have fields named like ecosystem_anon_id that have JOSE payload strings as values. The output of EcosystemDecryptor should have fields named like ecosystem_user_id with decrypted UUID-like values.

To support this, we could separate schemas for validation of the input payload and for the output payload. The EcosystemDecryptor could change the assigned document_type.

Or, we could build the schemas assuming the output fields (ecosystem_user_id), and then leave it to code within EcosystemDecryptor to check for the pre-decryption field names.

jklukas commented 4 years ago

After writing the above, I think I've convinced myself that putting the encrypted values directly into the payload is the only sufficiently flexible option. It also seems like the least surprising design.

The downside is that it adds complexity to the pipeline implementation. At this point, I want to make sure to discuss these concerns with @whd for the operations perspective and then I'll try to sort out a proposal for all the details about how we'll lay out schemas and how we'll provide safeguards to prevent accidentally passing on encrypted values to further stages of the pipeline.

chutten commented 4 years ago

From Firefox Desktop's perspective, anything outside of the payload isn't saved between ping assembly and ping sending. Headers and host and path are all determined at ping sending time which could be delayed any amount of time (and may have a roundtrip against the disk) after ping assembly.

This is not insurmountable (we could extract the id from inside the payload at ping sending time), though it would get expensive to try to add to Pingsender (since it presently treats all payloads as opaque). (we could choose to not use Pingsender for the relevant ping types)

From the Glean SDK's perspective a similar limitation presently exists, though the plan is to lift that limitation for other reasons. The Glean SDK is more flexible in this.

All that being said, a solution that allows us to keep ids inside the payloads that they are identifying will be the tidiest solution from the client perspective (Firefox Telemetry and Glean SDK both).

rfk commented 4 years ago

That design doesn't consider the case of deletion requests, though, where we've discussed that we might need to send a whole array of previous values. I think at that point, the header-based approach breaks down and we would have to codify the values inside the json payload

I wonder if the "send a whole array of previous values" approach may not be necessary if we maintain the list of historical ids inside bigquery, which we might find ourselves doing anyway in order to cope with changes in the id over time. The details of how we solve that "canonicalize to a single id" problem might affect the shape of the "delete all data for this user" problem.

whd commented 4 years ago

At this point all parties are pretty much convinced we should store the encrypted identifier(s) in the payload, and my concerns are about the implementation of that version of the story.

We discussed earlier today some advantages and disadvantages of splitting these pings out to their own pipeline family at the edge. After some consideration, I think we should investigate the approach of annotating AET pings with pipeline-supplied headers at ingestion time, and using that annotation as an implementation detail in downstream processing (prior art: X-Pipeline-Proxy). Additional considerations we're balancing are:

(a) Trying to minimize the number of pubsub topics involved

In an ideal world the semantics are much nicer, as we can simply add another layer of pubsub for this rerouting. This is prohibitively expensive when dealing with large topics such as structured-raw or telemetry-raw (n.b. this is not an issue if we split these out to their own pipeline family at the edge, since that topic will be small).

(b) Maintaining parity as far as possible with Pioneer

I'd like the configuration/code for performing decryption to be substantially the same in both stacks.

The idea is to add a generalized header such as X-Pipeline-Do-Not-Store (somewhat terrible name given the acronym but illustrative) based on URI at the nginx layer, something akin to:

    set $aet "";
    if ($uri ~* "^.*/aet/.*$") {
        set $aet 1;
    }
    proxy_set_header X-Pipeline-Do-Not-Store $aet;

There are a couple alternate flows/topologies that we could construct based on this header. Here's one I'm presently inclined toward:

In this manner we process all AET pings twice in decoders and sinks (both pre- and post-AET decryption), but the encrypted identifier is never written to BigQuery. The "decryption stage" that does the AET decrypt implements the bare minimum decoding required to decrypt, and has different error-processing semantics based on the presence or absence of the D-N-S header.

JSON schemas could contain encryption metadata like so:

  meta: {
      encryption: {
        encrypted_field: unencrypted_field // d-n-s information could also be redundantly encoded here
      }
  },

  unencrypted_field: <regular_json_schema>

The "decryption stage" is simply a decoder that reads the encryption metadata (and job config for KMS particulars) and replaces all encrypted fields with their replacement fields. This can be used as a preliminary stage within the pioneer decoder logic (i.e. there is no separate decryption job for pioneer), or as a separate decoder job as outlined for AET. The standalone job would need to output in raw format for further reprocessing and final schema validation/geoip/etc. by the general decoder.

The above method of schema management would mean the standard decoder and MSG would continue to work without changes, since the input data to the decoder will be post-decryption or summarily dropped for containing a D-N-S header.

Regardless of whether we adopt the above topology wholesale, I think we should consider the mechanism of supplying custom headers as an alternate path forward here.

jklukas commented 4 years ago

There is a dependency here on URI structure

Much of the complexity that @whd proposes could be avoided if we can commit to all AET pings being submitted to a common URI prefix /submit/account-ecosystem. I am discussing with client folks right now about how feasible it would be to support that restriction in Glean and in desktop. If we can swing supporting that restriction, then I think we can avoid much of the complexity here and provision a proper separate pipeline family for AET, which then reduces this to a discussion of what to do with the payload-bytes-raw sink in that pipeline family.

jklukas commented 4 years ago

Potential solution: add an AET decryptor job per pipeline family

I just zoomed with @whd and we may have agreed on a way forward here.

We have two major "pipeline families", one for telemetry (anything submitted to submit/telemetry) and one for structured ingestion. For each of these two pipeline families, we will a separate AETDecryptor Dataflow job.

At the ingestion-edge service, we will add routing rules:

/submit/telemetry//account-ecosystem/ -> telemetry-aet PubSub topic /submit/account-ecosystem/ -> structured-aet PubSub topic

These topics will each feed into an instance of AETDecryptor. That job will:

From there, it is processed like any other ping and flows through subsequent stages of the pipeline.

Note that in the AETDecryptor, pretty much our only recourse is to throw out the ping if it doesn't look like we expect, so there is risk of data loss there.

Ideally, the AETDecryptor job would be given a different service account from the rest of the pipeline such that we can limit access to KMS to just that stage of the pipeline.

whd commented 4 years ago

At the ingestion-edge service, we will add routing rules:

/submit/telemetry//account-ecosystem/rest:path -> telemetry-aet PubSub topic /submit/account-ecosystem/rest:path -> structured-aet PubSub topic

I was under the (quite possibly incorrect) impression that structured aet would also be sent to each individual namespace (e.g. glean style, org-mozilla-firefox would send its own account-ecosystem ping to /submit/org-mozilla-firefox/account-ecosystem/...). The goal of mapping to a per-pipeline-family configured topic for these pings is the same in either case.

These topics will each feed into an instance of AETDecryptor. That job will:

  • Replace the ecosystem_anon_id with an ecosystem_user_id field in the payload

I am still partial to specifying generically this replacement via schema metadata (and using the same mechanism for pioneer). We could have the metadata include fields that are encrypted and their replacements fields, as well as a notion of sensitivity to guide error semantics. Effectively, MPS AET schema metadata would map ecosystem_anon_id to ecosystem_user_id and indicate that the job should throw out data if the decryption fails. The AETDecryptor then is basically a pre-Decoder (also applicable to pioneer) that uses schema metadata as its configuration. The standard Decoder ignores this metadata and processes pings using the standard schema (as would all downstream configuration e.g. MSG).

If this mechanism is generic enough we could consider replacing -aet with something less specific, but I am not recommending this at present.

Ideally, the AETDecryptor job would be given a different service account from the rest of the pipeline such that we can limit access to KMS to just that stage of the pipeline.

Each job type has its own service account currently so this will also be reality. It may make sense to have per-pipeline-family service accounts at some point but it won't matter for this case since they will be accessing the same KMS.

jklukas commented 4 years ago

We have made the decision here to include the values in the payload rather than in headers. I'm closing this out and I've tried to capture the other design concerns here in new draft docs: https://github.com/mozilla/gcp-ingestion/pull/1264