Closed avilella closed 1 year ago
Hi @avilella
To troubleshoot we might have to eliminate few things from consideration.
1) Please check if the installed CUDA/cuDNN versions match the GPU. 2) Please reinstall the drivers. 3) Please check if the GPU's are broken/faulty (there may be some tools available, for example, https://github.com/wilicc/gpu-burn (never used this, so cannot comment about it, use it at your own risk)) 4) Test the AF2 runs using small protein sequences and see if this is successful (P1000's have 4GB memory, so we need to make sure that this is not due to the "memory ran out" issue)
At the moment I can think of only these things. Please check and let me know and I will try to troubleshoot as much as I can.
Thanks, I'll check it out.
(4) we can discard, as these never go higher than 3Gb for the jobs I am submitting.
(3) I am intrigued about the possibility that it's faulty GPUs: I'll swap the 2 cards for 2 slightly different cards and run on the same Ubuntu 21.04, same drivers, and hopefully this will clarify it (4) is a problem rather than (1) or (2).
Thanks for the detailed enumeration, I'll follow up with the results of the investigation in case it helps other people with the same problem.
On Fri, Oct 1, 2021 at 8:34 AM Sanjay Kumar Srikakulam < @.***> wrote:
Hi @avilella https://github.com/avilella
To troubleshoot we might have to eliminate few things from consideration.
- Please check if the installed CUDA/cuDNN versions match the GPU.
- Please reinstall the drivers.
- Please check if the GPU's are broken/faulty (there may be some tools available, for example, https://github.com/wilicc/gpu-burn (never used this, so cannot comment about it, use it at your own risk))
- Test the AF2 runs using small protein sequences and see if this is successful (P1000's have 4GB memory, so we need to make sure that this is not due to the "memory ran out" issue)
At the moment I can think of only these things. Please check and let me know and I will try to troubleshoot as much as I can.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kalininalab/alphafold_non_docker/issues/19#issuecomment-931982779, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN2752KXVXVQH5VNCW3UEVQAHANCNFSM5FBU4DXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
This could be unrelated to this repo and instead be just some sort of drivers issue, but I'll post the error just in case someone can help.
We've installed this repo in an Ubuntu 21.04 Laptop with Thunderbolt and an eGPU with 2 Nvidia Quadro P1000 cards.
We kick off two parallel jobs, one on node 0 and another one on node 1, and they mostly go well, but after a few minutes/hours, sometimes one of the jobs gets stuck with the error below:
Any ideas wellcomed, thanks