google-deepmind / alphafold

Open source code for AlphaFold 2.
Apache License 2.0
12.9k stars 2.29k forks source link

Endless "[Perf]: MatMul reference implementation being executed generated" messages and dramatic inference slowdown upon updating to docker 27.3.1 #1021

Open lucajovine opened 2 months ago

lucajovine commented 2 months ago

Hello,

One of my systems, running Ubuntu 22.04, updated docker to version 27.3.1 (build ce12230)... after which running AlphaFold2 produces a neverending series of messages like the following once it reaches the "Running predict with shape" stage:

I0923 14:01:56.716146 129352767932224 run_docker.py:258] 2024-09-23 12:01:56.715574: W external/xla/xla/service/cpu/onednn_matmul.cc:293] [Perf]: MatMul reference implementation being executed
I0923 14:01:56.741649 129352767932224 run_docker.py:258] 2024-09-23 12:01:56.741240: W external/xla/xla/service/cpu/onednn_matmul.cc:293] [Perf]: MatMul reference implementation being executed
(...)

This issue dramatically impacts runtime, and could be fixed by reverting docker and preventing it from re-updating:

> sudo apt-get install docker-ce=5:26.1.4-1~ubuntu.20.04~focal docker-ce-cli=5:26.1.4-1~ubuntu.20.04~focal containerd.io
> sudo apt-mark hold docker-ce docker-ce-cli
> docker --version
    Docker version 26.1.4, build 5650f9b

However, I thought I should nonetheless report it as other users may have the same issue, and you can most likely fix it in a straightforward way.

Thanks,

Luca

DrRadan commented 1 month ago

Thanks @lucajovine for this solution. I suddenly started having the same issue with sequences that were previously running without a problem once I upgraded all outdated modules in my system in the beginning of this month. Downgrading docker as you suggest fixed it. So I learned the lesson as a new server admin of the importance of pinning (apt-mark hold )! Perhaps there are suggestions of other AF dependencies that should be pinned?

BTW A slight variation from your solution in case it is useful: I am also running on Ubuntu 22.0.4 and I have success running AF on this version of docker: VERSION_STRING=5:24.0.0-1~ubuntu.22.04~jammy

lucajovine commented 1 month ago

Hi @DrRadan glad this was useful! I essentially froze my conda environment for AF2, but being docker a system-level tool it got updated anyway, which caused the issue. If you are not already running AF2 in its own environment, that's probably what i would suggest to do to avoid similar issues.

hofmank0 commented 3 weeks ago

Are you sure that the slow-down was caused by the repeated error message? I had the same problem, but in our case the errors were caused by docker not using the GPU - which obviously slows things down a lot. See my problem report here: https://github.com/google-deepmind/alphafold/issues/1035 However, the workaround with the older docker version fixed things for me, too.