privacysandbox / protected-auction-key-value-service

Protected Auction Key/Value Service
Apache License 2.0
54 stars 20 forks source link

Composability: Key Value/Ad Retrieval Service and Pluggable Storage/Query Engines #50

Open thegreatfatzby opened 6 months ago

thegreatfatzby commented 6 months ago

So something I've been noodling on for a while is that we (and I include ASAPI in "we") are subtly coupling behavior and implementation/technology in some places. This is completely understandable given the phase of development we're in, level of general understanding across the industry, the new-ness (to this industry at least) of integrating DP into inputs/outputs to functions (in the most general sense), etc. Arguably it's a good/necessary thing to get off the ground...but I would imagine we'd all like to decouple complicated/hard things like storage/query engines from complicated things like "privacy preservation" layer if we could.

So I'd like to get thoughts on a particular issue I see coming soon, which is the ability store and retrieve data in more nuanced ways but still enforce privacy.

In our experience at App-Xandr-Soft, data storage and retrieval is not an easy problem to solve with a single solution. Here are some KV/AR issues we've run into for MSAN, Monetize, and Invest:

  1. Scale/Performance/Cost: Different types of data have very different SLAs/QoS/etc in our platform. For some data the write/read load is great enough that we've paid a premium for Aerospike on SSDs; for others we've been able to get by with a free-but-tuned Postgres on normal'ish hardware; in the middle we use Scylla, I think on normal'ish hardware.
  2. Data Retrieval Patterns: not all of our hot path data is queried on a pure key-value basis. As examples we do IP range queries, geo-polygon, and Approximate Nearest Neighbors for different cases. In these cases, the storage engine is generally doing interesting data layout to optimize queries we care to, and the query engine is tuned for that.

I suspect in the long run it will be very difficult to support ad-tech functionality if we don't support this type of scaling and retrieval.

At a gross level of simplification, in simple app development this seems like something we'd solve with an adapter pattern, so I'm interested in discussing pluggable storage/query backend engines for the KV/AR service. Some strawmen I'll put out for open fire:

  1. Keep the data and logic within this project but allow for pluggable storage and query engines. In this case a company that really cares about DiskANN, Geo Polygon based lookups, etc, could write/use/buy a storage/query plugin that implements the PrivateKVARBackendInterface and then query that from their UDFs for KVs or creatives for further bidding.
  2. Keep the query plugin idea, but allow for that query plugin to hit Managed Service Databases in the cloud.
  3. Keep the query plugin idea, but allow for that query plugin to hit other services that are in TEEs.

(1) has the clear downside of needing to implement both storage and query adapters and store data on the same servers, which will be operationally challenging, although would seem the safest step from a privacy perspective.

(2) would have some of the same issues as discussed with general composability here, and I can currently only wave my hands at a general idea of a managed service database in which the cloud provided interface ensures certain guarantees around observability of queries and data, although I'd guess it's not strictly impossible.

(3), same issue as (2) but harder I'd think.

So, obvious issues to discuss, but I think very worth discussing. ¡Vamos!

peiwenhu commented 5 months ago

Hi @thegreatfatzby , allowing the TEE to query untrusted systems on a per-request level would ship information from within the TEE to the outside. So any dependency of a TEE has to be also hosted within TEEs. It seems to me only (3) is true for that. Or maybe I'm missing something?

thegreatfatzby commented 5 months ago

I think (1) would satisfy it as well, since in that scenario I'm hypothesizing that the storage and retrieval would still stay on the KV-AR server.

peiwenhu commented 5 months ago

I see. If it stays within the TEE, it's essentially part of the server - since it has access to the same level of information as the core server. So the plug-in would be more in the sense of allowing external contributions to the codebase rather than allowing additional functionality without the same level of scrutiny that the core server code would experience.

External contributions on the advanced features make sense though.

thegreatfatzby commented 5 months ago

@peiwenhu for (2) above, the managed service option, I wanted to dig on this more...are the issues here that the operator would potentially have observability into:

  1. Any switches between the TEE and the service
  2. The service itself?

And then they could in theory write out?

For writing I would think we could allow only read in the drivers made available in the TEE. Observability in switches seems like a general thing not just in this situation, so understood on that. In theory, could a database vendor choose to make an attested version of their software that could run in confidential computing environments, maybe with less functionality than normal as needed?

peiwenhu commented 5 months ago

Yes 1. and 2 are the main concerns. (writing or reading doesn't matter at first really since they are all effectively shipping information out of the TEE as long as there's some observability. when there's no observability, yes writing would become a problem.)

In theory if a db vendor can make a version running in confidential computing environments recognized and attested by the same/equivalent standard as the TEE server itself, my personal opinion is it is possible for the TEE server to use it. (more discussions are needed to form an official position from Privacy Sandbox)