Closed neoformit closed 2 years ago
@bgruening
I think you need to have the NVidia container toolkit installed on your host. https://github.com/usegalaxy-eu/pulsar-network-docs/blob/71ee9918e690185f9741da710e76f16cbef57f0f/source/topics/gpus.rst
To my understanding then no cuda is needed in the container. Is that correct @gmauro?
@neoformit can you run docker run --gpus all nvidia/cuda:10.1-base nvidia-smi
on your GPU node?
@gmauro Yep, the cloud GPU seems to be working fine. Looking back at the tool stderr
we also had the same issue on your EU node starting from January. So it seems to be an issue with the container. We have rebuilt the container and that now seems to be working on your EU node - will push updates shortly if it all checks out.
Please keep in mind that containers are also cached, so if you do not change the name or version we/you might use an older cached container.
We have now resolved this issue locally using a new container build. I will push an update to the Toolshed next week with our recent revisions, and the updated docker image reference.
Running on Azure GPU cloud took even more configuration and container versions... we couldn't get it to run in Galaxy's Singularity (though CLI Singularity works fine) and fell back on using a Docker runner which seems to have less issues. We could possibly write some documentation on this if anyone is interested in cloud deployment.
Documentation is always most welcome, thank you @neoformit!
@neoformit do you have error logs? Have you compared the singularity run command of Galaxy to your local run command that worked?
@bgruening there was nothing useful that we could see in the system logs. Alphafold's tool stderr
was similar to what I originally posted - it works, but can't access the GPU driver for an "unknown" reason. We did try comparing the CLI and Galaxy run environments and updated a few Singularity environment variables and Slurm options in the job conf but nothing worked.
We have resolved this issue with a new docker image with updated Cuda and Jax dependencies. I just updated the Toolshed with a new docker tag in the tool XML, so there should be no issues with cached containers. I also reinstated the "working dir hack" such that it will only copy the working dir if this bug has occurred. @bgruening - this is apparently not isolated to your GPU dev node as we encountered the issue again while deploying to the Azure cloud. Perhaps it is part of our pulsar deployment, but we haven't noticed an issue like this in our other tools and can't see any deployment config that might cause it.
@neoformit can you please fill in a bug report about this working-dir hack. It's a bug and we should understand it and fix it.
Thanks for all your work.
@neoformit can you please fill in a bug report about this working-dir hack. It's a bug and we should understand it and fix it.
Thanks for all your work.
I can do that, though I feel like this one could be a nightmare to recreate!
Probably :) I have not seen this in any pulsar job. So maybe something specific to Docker/Pulsar? Have you tried submitting alphafold to your other non GPU pulsar nodes?
Please always version your containers and also pin your containers to a specific version. e.g. change neoformit/alphafold:latest
to neoformit/alphafold:0.1`
This seemed to have broken our setup. It seems your new container is producing some trouble. You don't have an old version around?
@bgruening if there were no extra changes old container could maybe correspond to the older dockerfile? (before https://github.com/usegalaxy-au/galaxy-local-tools/commit/78302ce1d79058f37b24c7b395de450f42631260)
Ok, I might have fixed it. The problem was that the old repo completely disappeared neoformit/alphafold-galaxy
I hacked this into our tool for now.
@martenson can you maybe test? Thanks.
Sorry, I was referring to the Dockerhub repo.
Oh, you mean docker.io/neoformit/alphafold-galaxy -- why is it reaching there? It should use https://hub.docker.com/r/neoformit/alphafold/ ?
Oh, so it is deeper than just using latest
, the whole repo changed. Thanks for explaining.
Seems so, yes.
Sorry for the poor communication, I thought it better to make a new docker hub repo as the new image is not Galaxy-specific. Since the old one was buggy I thought it best to blow it away but that was obviously premature, and inconsiderate! Again, sorry about that. Sloppy work on my part.
The latest toolshed version points to the neoformit/alphafold
image and contains a new flag --gpu_relax
that is required in the latest Alphafold version. Can you update to the latest toolshed version?
Good point on the latest
tag, I'll fix that tomorrow morning and push another update to the toolshed.
Probably :) I have not seen this in any pulsar job. So maybe something specific to Docker/Pulsar? Have you tried submitting alphafold to your other non GPU pulsar nodes?
That's not a bad idea, I'll create the issue and see if we can do some digging to add any info.
Thanks!
We are having some issues running Alphafold on our cloud GPU node. Alphafold runs, but in the logs we can see that the GPU is not being recognised, most likely due to incorrect or missing Cuda and/or Jaxlib libraries.
We checked back through the logs on the development GPU machine (EU) and the same issue started occurring on that machine last month - we never realised as the tool run completes on CPU alone:
We are currently working to resolve this issue.