opendatahub-io-contrib / datamesh-platform

Apache License 2.0
4 stars 3 forks source link

Can CI/CD test notebooks that need Trino/AWS access? #27

Open MichaelTiemannOSC opened 1 month ago

MichaelTiemannOSC commented 1 month ago

I don't think there's a way we can make notebooks run in CI/CD and safely protect AWS secrets or JWT tokens. An attacker can easily introspect their environment and steal keys from the service account by creating a pull request and examining outputs. I think therefore we need to be doubly careful that there are no credentials within the service account, or they are sufficiently well hidden that they cannot be extracted by mere mortals. It's too easy to look at environment variables (where keys are currently stashed).

This code is how we do stuff in our own environments:

# Load environment variables from credentials.env
osc.load_credentials_dotenv()

This loads a file not visible to github (credentials.env) that is chock-full of secrets and injects them into the python environment. Since the person running the scripts owns and controls their own credentials.env file, that's not technically a security leak. But if the CI/CD account were to have its own credentials.env file, then a user could dump the environment and secrets would leak from CI/CD to the malicious user.

It's one thing to safely pass the secret from a GitHub actions context to a program you want to run with the guarantee that nobody can see the secret in GitHub. It's another thing to protect the key from a malicious program that wants to leak it. How can SUPERSECRET be protected from `raise ValueError("attention hackers: S U P E R S E C R E T = " + os.env[SUPER_SECRET])`

The best way to limit the AWS blast radius would be to use public data that doesn't require key-signing to access, accessor pays, and top up the CI/CD account with the trivial costs of data access (that it would bear anyways under provider pays). Our test data is public data.

But there's still the trino JWT token. And to efficiently write data to trino we need a writable AWS data bucket where parquet files we write to Hive can be teleported to Iceberg parquet. (It's MIND BLOWING that Trino's import capabilities are so primitive and non-performant: the DERA-DATA ingestion takes 3-4 hours using "normal" Trino ingestion and 1-2 minutes using Hive->Trino, but that's another problem). So...unless and until we can solve the credentials leak problem, I don't think we should try make actual ingestion pipelines work with CI/CD. Rather we need write scripts that users can run to validate their expectations (things they need to read are readable; if they are writing, what they wrote is sensible) and whose receipts they can commit to "prove" their commits are in working order.

Taking a page from the pytest book, we can have our CI/CD notebook system ignore all files that don't start with test_ unless you ask nicely. If we change the GitHub actions to only test notebooks that match **/test_*.ipynb then the writers of notesbooks can decide if they are writing "must run to the end" notebooks (which are pytest friendly) or they can name them something that won't match.

But ultimately we need to provide clear guidance as to how we (1) protect secrets in our CI/CD system, and (2) limit the blast radius of inevitable failures.

@ModeSevenIndustrialSolutions