tweag / chainsail

Replica Exchange sampling as-a-service
MIT License
11 stars 1 forks source link

Make pods wait for user code containers to be actual ready #491

Open simeoncarstens opened 3 months ago

simeoncarstens commented 3 months ago

This is an attempt (and WIP) to solve #386. The current strategy is to implement a readiness or startup probe (currently, startup, but probably readiness is the appropriate one) to make sure the user code containers are ready, meaning the gRPC services for log-prob / gradient are ready to respond in a timely manner. Once the probe succeeds, that container is deemed ready / started, and the pod can be considered ready.

One pitfall could be that possibly the controller pod is also running a user code container that is actually used in the calculation. So we want the controller container to only start sending out sampling requests until not only once all user code containers in other pods are ready, but also the user code container in the controller pod has to be ready.

I'm not sure whether a startup probe is enough, or whether we need an init container on the controller pod that makes all controller pod containers start only once all other pods are ready.