usegalaxy-au / tools-au

A home for local tool wrappers on Galaxy Australia and a testing ground for changes to existing wrappers
MIT License
2 stars 9 forks source link

Alphafold cannot utilize GPU #11

Closed neoformit closed 2 years ago

neoformit commented 2 years ago

We are having some issues running Alphafold on our cloud GPU node. Alphafold runs, but in the logs we can see that the GPU is not being recognised, most likely due to incorrect or missing Cuda and/or Jaxlib libraries.

We checked back through the logs on the development GPU machine (EU) and the same issue started occurring on that machine last month - we never realised as the tool run completes on CPU alone:

2022-01-12 02:42:31.954225: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0112 02:42:32.218000 23117042176832 xla_bridge.py:244] Unable to initialize backend 'gpu': FAILED_PRECONDITION: No visible GPU devices.
I0112 02:42:32.219328 23117042176832 xla_bridge.py:244] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
W0112 02:42:32.219516 23117042176832 xla_bridge.py:248] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

We are currently working to resolve this issue.

neoformit commented 2 years ago

@bgruening

bgruening commented 2 years ago

I think you need to have the NVidia container toolkit installed on your host. https://github.com/usegalaxy-eu/pulsar-network-docs/blob/71ee9918e690185f9741da710e76f16cbef57f0f/source/topics/gpus.rst

To my understanding then no cuda is needed in the container. Is that correct @gmauro?

gmauro commented 2 years ago

@neoformit can you run docker run --gpus all nvidia/cuda:10.1-base nvidia-smi on your GPU node?

neoformit commented 2 years ago

@gmauro Yep, the cloud GPU seems to be working fine. Looking back at the tool stderr we also had the same issue on your EU node starting from January. So it seems to be an issue with the container. We have rebuilt the container and that now seems to be working on your EU node - will push updates shortly if it all checks out.

bgruening commented 2 years ago

Please keep in mind that containers are also cached, so if you do not change the name or version we/you might use an older cached container.

neoformit commented 2 years ago

We have now resolved this issue locally using a new container build. I will push an update to the Toolshed next week with our recent revisions, and the updated docker image reference.

Running on Azure GPU cloud took even more configuration and container versions... we couldn't get it to run in Galaxy's Singularity (though CLI Singularity works fine) and fell back on using a Docker runner which seems to have less issues. We could possibly write some documentation on this if anyone is interested in cloud deployment.

martenson commented 2 years ago

Documentation is always most welcome, thank you @neoformit!

bgruening commented 2 years ago

@neoformit do you have error logs? Have you compared the singularity run command of Galaxy to your local run command that worked?

neoformit commented 2 years ago

@bgruening there was nothing useful that we could see in the system logs. Alphafold's tool stderr was similar to what I originally posted - it works, but can't access the GPU driver for an "unknown" reason. We did try comparing the CLI and Galaxy run environments and updated a few Singularity environment variables and Slurm options in the job conf but nothing worked.

neoformit commented 2 years ago

We have resolved this issue with a new docker image with updated Cuda and Jax dependencies. I just updated the Toolshed with a new docker tag in the tool XML, so there should be no issues with cached containers. I also reinstated the "working dir hack" such that it will only copy the working dir if this bug has occurred. @bgruening - this is apparently not isolated to your GPU dev node as we encountered the issue again while deploying to the Azure cloud. Perhaps it is part of our pulsar deployment, but we haven't noticed an issue like this in our other tools and can't see any deployment config that might cause it.

bgruening commented 2 years ago

@neoformit can you please fill in a bug report about this working-dir hack. It's a bug and we should understand it and fix it.

Thanks for all your work.

neoformit commented 2 years ago

@neoformit can you please fill in a bug report about this working-dir hack. It's a bug and we should understand it and fix it.

Thanks for all your work.

I can do that, though I feel like this one could be a nightmare to recreate!

bgruening commented 2 years ago

Probably :) I have not seen this in any pulsar job. So maybe something specific to Docker/Pulsar? Have you tried submitting alphafold to your other non GPU pulsar nodes?

bgruening commented 2 years ago

Please always version your containers and also pin your containers to a specific version. e.g. change neoformit/alphafold:latest to neoformit/alphafold:0.1`

This seemed to have broken our setup. It seems your new container is producing some trouble. You don't have an old version around?

martenson commented 2 years ago

@bgruening if there were no extra changes old container could maybe correspond to the older dockerfile? (before https://github.com/usegalaxy-au/galaxy-local-tools/commit/78302ce1d79058f37b24c7b395de450f42631260)

bgruening commented 2 years ago

Ok, I might have fixed it. The problem was that the old repo completely disappeared neoformit/alphafold-galaxy I hacked this into our tool for now.

@martenson can you maybe test? Thanks.

martenson commented 2 years ago

https://github.com/neoformit/alphafold-galaxy

bgruening commented 2 years ago

Sorry, I was referring to the Dockerhub repo.

martenson commented 2 years ago

Oh, you mean docker.io/neoformit/alphafold-galaxy -- why is it reaching there? It should use https://hub.docker.com/r/neoformit/alphafold/ ?

bgruening commented 2 years ago

https://github.com/usegalaxy-au/galaxy-local-tools/commit/db594d89256db762341e3b81688418dfb7142891

martenson commented 2 years ago

Oh, so it is deeper than just using latest, the whole repo changed. Thanks for explaining.

bgruening commented 2 years ago

Seems so, yes.

neoformit commented 2 years ago

Sorry for the poor communication, I thought it better to make a new docker hub repo as the new image is not Galaxy-specific. Since the old one was buggy I thought it best to blow it away but that was obviously premature, and inconsiderate! Again, sorry about that. Sloppy work on my part.

The latest toolshed version points to the neoformit/alphafold image and contains a new flag --gpu_relax that is required in the latest Alphafold version. Can you update to the latest toolshed version?

Good point on the latest tag, I'll fix that tomorrow morning and push another update to the toolshed.

neoformit commented 2 years ago

Probably :) I have not seen this in any pulsar job. So maybe something specific to Docker/Pulsar? Have you tried submitting alphafold to your other non GPU pulsar nodes?

That's not a bad idea, I'll create the issue and see if we can do some digging to add any info.

neoformit commented 2 years ago

https://github.com/galaxyproject/pulsar/issues/296

bgruening commented 2 years ago

Thanks!