job logs: indicate sha1 of running image

tiborsimko commented 2 years ago

Current behaviour

It happens that when users use non-semantically-versioned environment images such as myenviroment:latest or myenvironment:master, and they update this image using the same image tag, the cluster nodes won't pull the new version because of the usual ifNotPreset image pull policy.

It can then happen that some cluster nodes have "old" version of the image, while other cluster nodes have "new" version of the image, leading to seemingly random workflow run failures.

Currently, it is not easy to detect these situations by the user, because REANA does not expose in the job logs which image sha1 was exactly used for the job. The cluster administrators can check and rectify this easily by removing images on the nodes, which forces re-pull of the image for the next run. For example by running the following one-liner:

$ for node in $(kubectl get nodes -l reana.io/system=runtimejobs | awk '{print $1;}'); do ssh -q -i ~/.ssh/myaccount.pem -o StrictHostKeyChecking=no core@$node 'sudo crictl rmi myenvironment:latest'; done

Howewer, we can perhaps do something better to help the users.

Expected behaviour

Ideally we should display in the job logs that the job was run using image myenvironment:latest with sha1 of such and such value:

==> Workflow ID: 29f6859f-1389-4266-98f2-41df346cc000
==> Compute backend: Kubernetes
==> Job ID: reana-run-job-261a4396-5ffb-4e9b-953e-3be52a0faa18
==> Docker image: myenivronment:latest (9259e42215ab)

We could perhaps even consider exposing the node name where the job runs, which could be useful in forensics such as CephFS CSI plugins being down on some nodes etc.

VMois commented 2 years ago

suggestion: Or we can change ifNotPresent to Always. k8s will compare image digest (hash) and if it is cached locally, it will use the local image, if it is not cached or digests are different, it will pull a new image from the registry (docs).

If Always is used, it will, probably, add overhead to k8s nodes of querying a registry to check if a cached image is the same as one in the registry (one HTTP request, I guess). Not sure how much this will affect the pod starting time.

But regarding adding an image tag and digest to logs, I think, it is a good idea overall. Not quite sure about exposing the node names as it can, potentially, be a security issue (?).

tiborsimko commented 2 years ago

Always will bring some overhead, which may be considerable in case of multi-GiB-large particle physics images... Hence we opted for IfNotPresent as default, together with promoting semantic versioning of docker images, which is the best for ensuring reproducitbility anyway! The reana-client validate also checks for the most comonly used latest, but it doesn't get everything. So yes, hopefully we can stay on IfNotPresent... But switching to Always via helm values is always an option.

reanahub / reana-job-controller

job logs: indicate sha1 of running image #349