google-deepmind / alphafold

Open source code for AlphaFold.
Apache License 2.0
12.7k stars 2.25k forks source link

Alphafold stopped using GPU after upgrade to Ubuntu 24.04.1 (noble) #1035

Open hofmank0 opened 1 week ago

hofmank0 commented 1 week ago

I have been runing the current version of Alphafold successfully on a linux machine with Ubuntu 22.04 and an Nvidia 4090 GPU. After some automatic update (graphics driver, maybe?) , it mysteriously was getting super slow, probably in CPU mode. I decided to upgrade the machin to 24.04.1 (noble), which was probably a bad idea. It made matters worse, in the end I installed everything new:

Nvidia graphics driver 560.35.03 (open) Docker 27.3.1 Nvidia container toolkit 1.16.2-1 and the usual stuff that goes with Ubuntu 24.04.1 (python 3.12.3, gcc 13.2.0)

I followed the instruction, as I did in the previous successful installations. First problem was with the Dockerfile, covered in https://github.com/google-deepmind/alphafold/issues/945 I 'fixed' it by following the comments of "rosswalker" which also worked for others. This allowed me to build the docker image. The databases were still installed, I just edited run_docker.py for setting the database and output paths.

However, the big problem: When running alphafold, I got the following error message:

I1024 21:50:40.124427 127699774173312 run_docker.py:260] I1024 19:50:40.123955 136147855569536 xla_bridge.py:863] Unable to initialize backend 'cuda': jaxlib/cuda/versions_helpers.cc:98: operation cuInit(0) failed: Unknown CUDA error 303; cuGetErrorName failed. This probably means that JAX was unable to load the CUDA libraries.

I know nothing about DOCKER and really don't know what I am doing. This message looks like CUDA is not found (?). Before that line, there was another suspicious output, but I am not sure if it is related:

I1024 21:50:38.367850 127699774173312 run_docker.py:260] /bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)

Maybe someone has solved this problem or can share a Dockerfile that works on Ubuntu 24.4.01 with a contemporary (4XXX) Nvidia GPU? Or maybe I should use another version of Docker or CUDA?

MamfTheKramf commented 1 day ago

We (my team and I) had a similar issue. After trying out different things we were pretty sure it was caused by the Docker SDK for Python requirement. It's required version is 5.0.0 which is quite old compared to the docker version you are using.

What probably works (our solution is a lot more complicated, so I would try this first), is just bumping up the version of the docker package in requirements.txt and build the container again. If this already works, great. But if it doesn't work, here is our solution:

You can find the place where the alphafold container is created here in run_docker.py: First a client object is created and then the equivalent to a docker run command is called. Our guess was, that the Docker Server API slightly changed, such that everything seems to work properly at first, but mounting the GPUs into the container fails. (As a quick check, just start alphafold as you would and then run docker exec -it <alphafold-container-id> bash; once inside the container run nvidia-smi; It will probably throw some error. But when you start the container yourself: docker run --rm --entrypoint bash -it --gpus all <alphafold-image> and then run nvidia-smi it should work).

The solution is the construct the corresponding docker run command and run it by calling subprocess.run. This code should probably work (I don't have access to our actual code right now and didn't have the machine to test it right now); it's a modification from the code in run_docker.py mentioned above:

  command_args.extend([
      f'--output_dir={output_target_path}',
      f'--max_template_date={FLAGS.max_template_date}',
      f'--db_preset={FLAGS.db_preset}',
      f'--model_preset={FLAGS.model_preset}',
      f'--benchmark={FLAGS.benchmark}',
      f'--use_precomputed_msas={FLAGS.use_precomputed_msas}',
      f'--num_multimer_predictions_per_model={FLAGS.num_multimer_predictions_per_model}',
      f'--models_to_relax={FLAGS.models_to_relax}',
      f'--use_gpu_relax={use_gpu_relax}',
      '--logtostderr',
  ])

  # --- new code starts here ---

  cmd_parts = [
    "docker",
    "run",
    "--rm",                                         # equivalent of remove=True
    "-d",                                           # equivalent of detach=True
    f"-u {FLAGS.docker_user}",                      # equivalent of user=Flags.docker_user
    # setting the env vars
    f"-e NVIDIA_VISIBLE_DEVICES={FLAGS.gpu_devices}",  
    "-e TF_FORCE_UNIFIED_MEMORY=1",
    "-e XLA_PYTHON_CLIENT_MEM_FRACTION=4.0",
    "--gpus all",                                   # to use GPUs in container
  ]
  # setting the valome bindings
  for mount in mounts:
    mnt_str = f"{mount['source']}:{mount['target']}"
    if mount['read_only']:
      mnt_str += ":ro"
    cmd_parts.append("-v " + mnt_str)
  # specify docker image
  cmd_parts.append(FLAGS.docker_image_name)
  # specify command args
  cmd_parts.extend(command_args)

  # Just print the command for debugging purposes (if you can't see it in the output, use logging.info instead of logging.debug)
  logging.debug(f"Run command: f{' \\\n  '.join(cmd_parts)}")
  import subprocess
  # probably want to do some error handling here (at least print stderr)
  container_id = subprocess.run(cmd_parts, capture_output=True).stdout.decode()

  client = docker.from_env()
  container = client.containers.get(container_id)

  # covered by --gpus all argument
  # device_requests = [
  #     docker.types.DeviceRequest(driver='nvidia', capabilities=[['gpu']])
  # ] if FLAGS.use_gpu else None

  # container = client.containers.run(
  #     image=FLAGS.docker_image_name,
  #     command=command_args,
  #     device_requests=device_requests,
  #     remove=True,
  #     detach=True,
  #     mounts=mounts,
  #     user=FLAGS.docker_user,
  #     environment={
  #         'NVIDIA_VISIBLE_DEVICES': FLAGS.gpu_devices,
  #         # The following flags allow us to make predictions on proteins that
  #         # would typically be too long to fit into GPU memory.
  #         'TF_FORCE_UNIFIED_MEMORY': '1',
  #         'XLA_PYTHON_CLIENT_MEM_FRACTION': '4.0',
  #     })

  # --- new code ends here ---

  # Add signal handler to ensure CTRL+C also stops the running container.
  signal.signal(signal.SIGINT,
                lambda unused_sig, unused_frame: container.kill())

Note: The keys for the dictionary access of the mount objects might not be correct. Maybe they need to be in PascalCase, I'm not sure right now.