Executive summary

We must prove that our cloud infrastructure matches the Docker images, Kubernetes resources and Terraform resources in our open source repositories. Simply asking people to trust us is not an option: They must have the certainty that we're not spying on them, selling their data, running a mass surveillance programme for the Five Eyes or censoring people.

Another (equally important) reason to do this is to protect the Relaycorp SRE team from powerful adversaries who might secretly try to force us (collectively or individually) to give away certain metadata or censor certain users/services. If every change or external access to the infrastructure is independently verified, this threat should be avoided. (Unfortunately, an attacker unaware of this measure may still target us, but we can mitigate this by advertising very prominently the fact that our cloud infrastructure is independently verified.)

We basically have to prove two things: That our cloud infrastructure is exactly what people can find on GitHub, and that we don't have any backdoors.

Why? Relaynet is end-to-end encrypted and doesn't leak PII

Indeed, those two properties make Relaynet apps immune to a wide range of privacy threats you'd tend to find in Internet apps. However, we could theoretically still infer the following:

Who's talking to whom: Each device running Relaynet will have a globally unique address derived from a public key (analogous to Bitcoin addresses), so, theoretically, we -- as the operator of a Relaynet-Internet Gateway -- could infer who's talking to whom because we'd have the address of the two private gateways in any communication. Even after https://github.com/relaynet/specs/issues/27 has been implemented, the operator of the public gateway could still infer who's talking to whom if both peers are served by the same public gateway.
When interacting with certain centralised services, we could identify the service the user is using. That's because messages bound for the service' servers may have a URL like https://relaynet.twitter.com. This doesn't apply to decentralised services, or centralised services where the provider chooses to run their server-side app behind a Relaynet-Internet Gateway.

Additionally, we need to log the IP addresses from end users and couriers so that we ensure our systems are being used fairly and to triage production issues.

Finally, it's likely we'll eventually have to block certain centralised services to comply with UK/US legislation, so in this case we'd have to prove that we're only blocking the services listed publicly. (This doesn't apply to decentralised services, which we could never block)

How? Cloud provenance is not a thing yet

Option A: Google Trillian Logs

In an ideal world, our cloud providers (Terraform Cloud, Kubernetes, GCP, Mongo Atlas and Cloudflare) would use a tool like Google Trillian to log provisioning, deprovisioning and access events. This would allow us to broadcast logs so that anyone anywhere can verify the integrity of our cloud infrastructure.

We'd essentially be moving the provenance issue up in the chain, and it'd be up to cloud providers to honour their contractual obligations with Relaycorp and comply with applicable legislation. They'd have a lot to lose if they don't.

But this option isn't really an option in the foreseeable future.

Option B: Ask a reputable, independent third-party to audit our infrastructure in real time

They'd basically get read-only access to the configuration of our cloud resources (but no access to the data inside), as well as their (de)provisioning and access logs. With this level of access, they could operate a system 24/7 to monitor our cloud resources and make sure they match the public Docker images, Kubernetes resources and Terraform resources.

I don't think a software tool like this exists yet, so we'll have to build it and make it open source. This tool has to be trivial to deploy, run and upgrade.

This tool should effectively make sure that provisioning and deprovisioning events match changes to cloud resources on GitHub. Additionally, the tool could also consume access logs so this independent party can be alerted to any direct access to the DB (for example) -- If we need to access the DB, we should justify that access to them (e.g., investigating a security vulnerability).

This would make offsite backups tricky, because we'd need a secret key to decrypt the backups if we need to restore them. One way to address this is by splitting the key, and having their part of the key available on demand in the tool. But this would introduce two additional challenges:

We'll have to do an event similar to a root CA key signing ceremony when generating and splitting the key.
The tool would have to be highly available: If we need to restore a backup, we have to be able to do it almost instantly -- with no advance notice or request. (Of course, they'd still be alerted if we retrieve it and we'd have to justify why we did it)

Option C: Deploy an independent tool that tracks our infrastructure in real time

We'd leverage the tool described in Option B, but we'd deploy it ourselves to a separate GCP project whose audit logs are publicly available.

Publishing audit logs is a bit risky, since they might (occasionally) contain sensitive information or PII about Relaycorp staff, which is why we're not making audit logs publicly available in the GCP projects hosting the services.

This option has no dependencies on third parties so it seems like the most likely approach to begin with.

Provenance is necessary but not sufficient to gain trust

There are many more things we have to do to gain people's trust, including non-technical measures such as transparency when dealing with law enforcement (I think Signal is an example to follow in this regard).

relaycorp / cloud-gateway

Prove that our cloud infrastructure matches the code in our open source repositories #8