singularityhub / singularity-hpc

Local filesystem registry for containers (intended for HPC) using Lmod or Environment Modules. Works for users and admins.
https://singularity-hpc.readthedocs.io
Mozilla Public License 2.0
110 stars 25 forks source link

Error on Singularity Pull for NVIDIA TensorFlow Container #672

Open visakhraja opened 5 months ago

visakhraja commented 5 months ago

Failed when attempting to install the nvcr.io/nvidia/tensorflow:24.02-tf2-py3-igpu container using SHPC (Supercontainers HPC)

Error log: singularity pull --name /p/home/jusers/sivaprasad1/jureca/easybuild/jurecadc/modules/containers/nvcr.io/nvidia/tensorflow/24.02-tf2-py3-igpu/nvcr.io-nvidia-tensorflow-24.02-tf2-py3-igpu-sha256:3de8a232b25d658d7c5ae34c4fa04d1a9823b0a681636c8864f76d109a9528c9.sif docker://nvcr.io/nvidia/tensorflow@sha256:3de8a232b25d658d7c5ae34c4fa04d1a9823b0a681636c8864f76d109a9528c9 INFO: Converting OCI blobs to SIF format INFO: Starting build... FATAL: While making image from oci registry: error fetching image to cache: while building SIF from layers: conveyor failed to get: while fetching image: initializing source oci:/p/home/jusers/sivaprasad1/jureca/.apptainer/cache/blob:c0cd6cdc1f956b77ac8ce780ac33b216cb41449d438966dd51f487a853ee0578: choosing an image from manifest list docker://nvcr.io/nvidia/tensorflow@sha256:3de8a232b25d658d7c5ae34c4fa04d1a9823b0a681636c8864f76d109a9528c9: no image found in manifest list for architecture amd64, variant "", OS linux

Traceback (most recent call last): File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/bin/shpc", line 8, in sys.exit(run_shpc()) ^^^^^^^^^^ File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/shpc/client/init.py", line 556, in run_shpc main(args=args, parser=parser, extra=extra, subparser=helper) File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/shpc/client/install.py", line 27, in main cli.install( File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/shpc/main/modules/base.py", line 467, in install if not module.container_path: ^^^^^^^^^^^^^^^^^^^^^ File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/shpc/main/modules/module.py", line 146, in container_path return self.add_container() ^^^^^^^^^^^^^^^^^^^^ File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/shpc/main/modules/module.py", line 94, in add_container self._container_path = self.container.registry_pull( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/shpc/main/container/singularity.py", line 258, in registry_pull self.pull(container_uri, container_path) File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/shpc/main/container/singularity.py", line 334, in pull return self._pull_regular(uri, dest) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/shpc/main/container/singularity.py", line 347, in _pull_regular for line in lines: File "/p/software/jurecadc/stages/2024/software/shpc/0.1.26-GCCcore-12.3.0/lib/python3.11/site-packages/spython/utils/terminal.py", line 148, in stream_command raise subprocess.CalledProcessError(return_code, cmd) subprocess.CalledProcessError: Command '['singularity', 'pull', '--name', '/p/home/jusers/sivaprasad1/jureca/easybuild/jurecadc/modules/containers/nvcr.io/nvidia/tensorflow/24.02-tf2-py3-igpu/nvcr.io-nvidia-tensorflow-24.02-tf2-py3-igpu-sha256:3de8a232b25d658d7c5ae34c4fa04d1a9823b0a681636c8864f76d109a9528c9.sif', 'docker://nvcr.io/nvidia/tensorflow@sha256:3de8a232b25d658d7c5ae34c4fa04d1a9823b0a681636c8864f76d109a9528c9']' returned non-zero exit status 255.

Support @surak

vsoch commented 5 months ago

It’s telling you it doesn’t have an architecture that matches for that digest. Did you read the error message?

surak commented 5 months ago

Hi @vsoch!

The problem is that this is the default "latest" for TensorFlow nvidia's container. Therefore, if one does a shpc install nvcr.io/nvidia/tensorflow in a x86-64, it will fail, and I find it hard to accept that there is no x86_64 package available for something.

docker: nvcr.io/nvidia/tensorflow
url: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow/tags
maintainer: '@vsoch'
description: TensorFlow is an open-source software library for high-performance numerical
  computation. Its flexible architecture allows easy deployment of computation across
  a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers
  to mobile and edge devices.
latest:
  24.02-tf2-py3-igpu: sha256:3de8a232b25d658d7c5ae34c4fa04d1a9823b0a681636c8864f76d109a9528c9

Checking the nvidia website, there is a 24.02-tf2-py3 and a 24.02-tf2-py3-igpu, which is arm64 only.

vsoch commented 5 months ago

There are over 8K containers in the registry, and they are added in an automated fashion, and indeed we don't check for that. If you'd like to PR to the registry to remove this tag and choose a better one, or just select another one, please feel free.

surak commented 5 months ago

Ah, ok!