How should we inject credentials to download Models and Datasets from third-party sources (e.g., huggingface)?

substratusai / runbooks

Finetune LLMs on K8s by using Runbooks

https://www.substratus.ai

Other

168 stars 14 forks source link

How should we inject credentials to download Models and Datasets from third-party sources (e.g., huggingface)? #107

Open brandonjbjelland opened 1 year ago

samos123 commented 1 year ago

I think this is getting more important with Llama 2 requiring your HF credentials to download it. Are you all open to passing the token through params for now? That would be most straight forward for the image to take in and not require controller work.

brandonjbjelland commented 1 year ago

☝️ I think this would work today if the image loader had a step of looking for PARAMS_HUGGING_FACE_HUB_TOKEN and export it as HUGGING_FACE_HUB_TOKEN prior to snapshot downloads. It's just not great security.

I didn't flesh out the issue details above but my preference would be to start building a story where data from Secrets can populate one or more params. Having HF or other tokens in plain text on CR specs isn't a story we can bring to market effectively.

nstogner commented 1 year ago

We should discourage secrets from being passed in a model object. My preference would be to have secrets set at the cluster or namespace level, and brought into params via something like:

kind: Model
spec:
  params:
    epochs: 3
    some_token: "{.secrets.huggingface.hub_token}"

brandonjbjelland commented 1 year ago

We should discourage secrets from being passed in a model object. My preference would be to have secrets set at the cluster or namespace level, and brought into params via something like:
kind: Model
spec:
  params:
    epochs: 3
    some_token: "{.secrets.huggingface.hub_token}"

To make sure we're on the same page - In this case, you'd expect that there's an existing secret named huggingface in the same namespace as the model, and its data contains a json blob with a key of hub_token or perhaps a more toml/ini style newline separated hub_token=tokenval.

I think this would make a lot less sense, but an alternative interpretation of the above could have a hub_token Secret living in the huggingface ns where we'd expect the raw data of the secret to be set to PARAMS_SOME_TOKEN at runtime (i.e., not expected to contain a k/v data structure). Crossing namespaces here seems unlikely to be what you meant - just verifying the intent.

brandonjbjelland commented 1 year ago

TL;DR - the secrets store CSI driver is an option

Pros: this solution actually protects secrets via IAM and has them live outside the cluster Cons: some legwork, doesn't solve multi-cloud substratus scenarios, grows the complexity of secret input via params

There would be a good amount of work to orchestrate on the user's behalf AND it would tend us toward an even more complex representation of env vars (something we're optimizing away from) but the Secret Store CSI driver is my preferred way to inject secrets into any old workload on GKE because the same technique is portable to Cloud Run, IAM protected, can persist independent of clusters, and global instead of cluster-scoped.

This would have our users putting secrets in GCP Secret Manager, Azure Key Vault, or AWS Systems Manager Parameter Store which can be shuffled into the pod at runtime. A user declares a SecretProviderClass and injects it into their workload via volumes. IRC you can do similar to push the data to env vars (it looks similar to injecting env vars from configmaps).

The CSI driver would need to be installed, the driver provider for the current cloud platform needs to be installed on the cluster, and the workload's pod SA would need the decrypt role for that to work.

I think at the point where we want to support multi-cluster deployments within a single cloud provider, this path starts to make a lot more sense.