soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 195 forks source link

mmseqs local runs considerably slower than mmseqs colabfold api regardless of optimizations #658

Open xeniorn opened 1 year ago

xeniorn commented 1 year ago

Some issues / unexpected behavior when running mmseqs search and other features used in colabfold. But the low speed happens in the mmseqs step, so I think this is the place to ask.

Unexpected observations:

  1. running mmseqs locally on our cluster is much slower than the responses we get from the colabfold server for equivalent inputs
  2. mmseqs search speed does not scale at all with added threads or added physical cores, for small inputs (1-2 sequences) - is this expected? Will it only speed up with large input count?

Welcoming any input! Many thanks!

Expected Behavior

Runtime of mmseqs2 for a given input similar to response time from the colabfold server.

Current Behavior

Considerably slower. For a given input (~250 aa) the server gives a full result in 20-30s, paired + unpaired reads (so, including at least 1 search each against uniref30 and colabfold_envdb, plus a number of expandaln / align calls). The local version takes 60-90 seconds to run just the initial mmseqs search against uniref30, when using vmtouch and db load mode 2, and ~300 seconds against colabfold_envdb. Db load mode 3 somewhat slower, db load mode 0 and 1 much slower as expected. Index was created without splits. The mmseqs search step is timed after vmtouch and other preparations are complete. The runtime is independent of either thread count or number of cpus provided. Splitting the target db index 2 4 8 or 16-fold and running in parallel with MPI does not speed it up either (slightly slows down). Placing the entire run directory fast SSD or RAM (/dev/shm) does not change the speed. This happens with the docker version (either avx, sse2 or sse41 versions), and the self-compiled version.

This also happens when a random aminoacid sequence of equivalent length is used as input, so I'm excluding caching on the colabfold api server side as a cause for it to be so fast.

Steps to Reproduce (for bugs)

cat <<'EOF' >mmseq_in.fasta
>UbcH7_Hs_1
MAASRRLMKELEEIRKCGMKNFRNIQVDEANLLTWQGLIVPDNPPYDKGAFRIEINFPAEYPFKPPKITFKTKIYHPNIDEKGQVCLPVISAENWKPATKTDQVIQSLIALVNDPQPEHPLRADLAEEYSKDRKKFCKNAEEFTKKYGEKRPVD
>Ub
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
EOF
mkdir -p out/tmp
mmseqs createdb mmseqs_in.fasta out/qdb
vmtouch -ldw /path/to/database/uniref30_2202_db*idx*
mmseqs search out/qdb /path/to/database/uniref30_2202_db out/res out/tmp --threads 1 --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000

MMseqs Output (for bugs)

attached out_apptainer_run.log

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

mmseqs2 docker: 7e2840992948ee89dcc336522dc98a74fe0adf00

self-compiled: ed4c55fa21c402e0a9497e5db5cd4b687a75fa02 ed4c55fa21c402e0a9497e5db5cd4b687a75fa02-MPI

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 152 On-line CPU(s) list: 0-151 Thread(s) per core: 2 Core(s) per socket: 19 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz Stepping: 4 CPU MHz: 2000.000 BogoMIPS: 4000.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-37 NUMA node1 CPU(s): 38-75 NUMA node2 CPU(s): 76-113 NUMA node3 CPU(s): 114-151 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology

milot-mirdita commented 1 year ago

This seems pretty puzzling, we are running on an older CPU model and nothing else seems out of order.

MPI won't help and will only add additional latency. MMseqs2 can be used in target split mode without MPI, but we don't do this either.

Doing additional splits will also only add latency (1 split with enough RAM to fit the whole .idx file and some spare is ideal).

RAM disks should also not help, in the worst case they might even be harmful. The index is already fully memory mapped and resident in the OS page cache due to vmtouch.

Maybe the KVM virtualization has some of issue? We are running bare metal.

Running only the MMseqs2 prefilter with perf might help debugging the performance issue. I can write more detailed instructions later if needed.

milot-mirdita commented 1 year ago

I took a closer look at the CPU flags. I think your KVM is disallowing sse41 and avx2. At least sse41 is required for reasonable performance.

milot-mirdita commented 1 year ago

Okay maybe that's not the issue since cmake claims to have found the various SIMD instruction sets to work:

-- Performing Test HAVE_AVX2_EXTENSIONS
-- Performing Test HAVE_AVX2_EXTENSIONS - Success
-- Performing Test HAVE_AVX_EXTENSIONS
-- Performing Test HAVE_AVX_EXTENSIONS - Success
-- Performing Test HAVE_SSE4_2_EXTENSIONS
-- Performing Test HAVE_SSE4_2_EXTENSIONS - Success
-- Performing Test HAVE_SSE4_1_EXTENSIONS
-- Performing Test HAVE_SSE4_1_EXTENSIONS - Success
-- Performing Test HAVE_SSE3_EXTENSIONS
-- Performing Test HAVE_SSE3_EXTENSIONS - Success
-- Performing Test HAVE_SSE2_EXTENSIONS
-- Performing Test HAVE_SSE2_EXTENSIONS - Success
-- Performing Test HAVE_SSE_EXTENSIONS
-- Performing Test HAVE_SSE_EXTENSIONS - Success
xeniorn commented 1 year ago

Happy to run any additional tests you might have in mind or provide any further info.

"Running only the MMseqs2 prefilter with perf might help debugging the performance issue."

Happy to take your code if provided, but I assume just copying out the first prefilter command printed in the log would do? E.g.

prefilter out/qdb_multi /scratch-cbe/users/juraj.ahel/db_copy/uniref30_2202_db.idx out/tmp/13555276572390281633/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 2 --compressed 0 -v 3

If it would be about lack of SIMD support, could I somehow see it in the runtime, whether with existing compilation or some flag that makes it crash in absence of AVX2?

I just remembered the compilation was done on the login node of the cluster, possibly there is an issue with the compute nodes. Will try and rerun cmake there.

milot-mirdita commented 1 year ago

Exactly, run the prefilter command with:

perf record -g -F 250 -- mmseqs prefilter ....rest-of-arguments---

This will generate a perf.data file that contains the perfomance profiling information.

It can be viewed with

perf report
or 
perf annotate

You should compile with cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo so that mmseqs contains debug symbols. This is not 100% the same as RelWithDebInfo uses -O2 and Release uses -O3 but it shouldn't really have a huge difference.

xeniorn commented 1 year ago

perf is not available on the high mem node on which I'm running the real workload, but is oddly available on the login node. Would that output be helpful?

+   51.95%    32.34%  mmseqs_perf  libc-2.17.so         [.] __memcpy_ssse3_back
+   45.94%    29.54%  mmseqs_perf  mmseqs               [.] QueryMatcher::match
+   35.99%    35.99%  mmseqs_perf  [unknown]            [k] 0xffffffffa55bb7c0
     0.63%     0.58%  mmseqs_perf  mmseqs               [.] KmerGenerator::calculateArrayProduct
+    0.56%     0.56%  mmseqs_perf  mmseqs               [.] memcpy@plt
     0.22%     0.22%  mmseqs_perf  ld-2.17.so           [.] do_lookup_x
     0.20%     0.20%  mmseqs_perf  libc-2.17.so         [.] __memset_sse2
     0.20%     0.00%  mmseqs_perf  [unknown]            [.] 0x0000000000000030
     0.15%     0.00%  mmseqs_perf  [unknown]            [.] 0000000000000000
     0.14%     0.14%  mmseqs_perf  libc-2.17.so         [.] _IO_file_underflow@@GLIBC_2.2.5
     0.13%     0.13%  mmseqs_perf  ld-2.17.so           [.] _dl_lookup_symbol_x
     0.13%     0.00%  mmseqs_perf  ld-2.17.so           [.] _dl_sysdep_start
     0.13%     0.00%  mmseqs_perf  ld-2.17.so           [.] dl_main
     0.13%     0.00%  mmseqs_perf  ld-2.17.so           [.] _dl_relocate_object
     0.06%     0.06%  mmseqs_perf  [unknown]            [k] 0xffffffffa55c6280

perf.data.txt

I will see if I can get perf on the real node if needed.

xeniorn commented 1 year ago

That's with the RelWithDebInfo-compiled version, on the login node without vmtouch, where the workload took 600 s.

On the compute node with vmtouch, it's 10 seconds, same goes for the release-compiled version, so indeed does not seem to make a difference.

xeniorn commented 1 year ago

Could you please also let me know what the runtime would be for such inputs on your machine? Just to have a target number in mind.

xeniorn commented 1 year ago

One other thing, not sure if helpful but to me it seems somewhat indicative. For mmseqs search, I'm getting an t=aN+b kind of dependence, where t is runtime and N number of provided sequences in the fasta file. The a factor seems actually quite fine, it's the b that makes it problematic. Otherwise it would be ~5 seconds to search through uniref for one 1:1 complex (150+70 aa), and presumably ~15 seconds for envdb, assuming linear scaling. Together with the non-search steps, that would fit into runtime we're getting from the colabfold server.

image

If you know the environment on which the colabfold server is running and library versions it was compiled against (or any other machine on which expected runtime is experienced), I can try to reproduce it in more similar conditions.

xeniorn commented 1 year ago

Ok, got the perf data from the actual compute node, see attached. perf.zip

image

xeniorn commented 1 year ago

Could this have anything to do with NUMA topology? Are your servers NUMA-less? Considering that all our high-mem nodes have multiple NUMA nodes, with at most ~370G per NUMA node, I'm thinking sub-optimal memory access might be the cause.

milot-mirdita commented 1 year ago

Sorry I was traveling. I don't see anything out of the ordinary in your screenshot. I still don't really have an idea whats going on. I repeated the UniRef search with your query on the MSA server:

        Command being timed: "mmseqs search mmseq_in_db uniref30_2202_db mmseq_in_res mmseq_in_tmp --threads 1 --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000"
        User time (seconds): 13.10
        System time (seconds): 21.13
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:34.23
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 19188072
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 9091215
        Voluntary context switches: 57
        Involuntary context switches: 503
        Swaps: 0
        File system inputs: 0
        File system outputs: 8888
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

One thing we do differently since we still have enough spare CPU capacity on the server is enable some limited multithreading and parallel processing.

We use --threads 2, which will hide quite a bit of overhead in your use-case:

        User time (seconds): 12.32
        System time (seconds): 20.86
        Percent of CPU this job got: 132%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.12

And we also start the env-db and template search in parallel to the expand-align steps. (path.colabfold.parallelstages parameter in the API server). This will hide quite a bit of latency. However, the initial search is still a lot faster for us despite worse CPU/RAM specs.

Can you check if vmtouch has actually but everything into RAM? htop should look something like this:

 root       20   0  605G  605G  605G S  0.0 15.0  6:22.98  /home/steineggerlab/miniforge3/bin/vmtouch -f -t -l -m 4000  colabfold_envdb_202108.idx pdb70_220313.idx uniref30_2...
milot-mirdita commented 1 year ago

I don't think its a NUMA issue either. Our system should be affected worse by NUMA issues since its a 8-socket system.

xeniorn commented 1 year ago

I am not finding the numbers now, but I had tested the effect of multiprocessing, both by giving it between 1 and 64 cpus (with fixed 2 threads per cpu), and by giving between 1 and 8 threads on a single cpu. Neither had a notable difference with this two-sequence input. With somewhat larger batches (e.g. 60) I could see it speeding up, but with diminishing returns as Nproc > Nseq/2 or so.

Parallel processing of the uniref and envdb we intend to do, as you say. Also reusing the uniref30 search (your app does it once for the MSA/monomer step and then again for the pair step, if I'm not mistaken).

Thank you for the info about the NUMA setup, I'll drop that line of exploration then.

For RAM - I had checked with vmtouch -v [file] and it was claiming to have all pages resident in RAM. Can't double check now with htop as all the high mem nodes are heavily used at the moment, but I can't imagine it would be reporting something different. The timing is very consistent with vmtouch, and an order of magnitude slower than when reading from disk for us.

xeniorn commented 1 year ago

Htop reports all (uniref30 db idx file) is in memory:

CPU    PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  4  65761 xxx.xxx  20   0  156G  156G  156G S  0.0 10.8  1:46.40 /software/extra/em/vmtouch/1.3.1/bin/vmtouch -ldw (...)
xeniorn commented 1 year ago

@milot-mirdita are you by any chance aware of a specific user that's running local mmseqs server with equivalent performance to what you have? I think you could be on the right track with the KVM, but I'd prefer to first check in with others to see there's correlation between KVM/bare-metal and this performance wall.

milot-mirdita commented 1 year ago

Not personally. But we have used the server on two completely differently spec-ed servers and the performance is okay both time:

As you can see the two systems are very different. However, we never tried to run on top of any virtualization. That seems to be the only really different variable. (The RAM on both systems is overkill, but these are the servers we have).

Could you try to run a SSE41 build and not an AVX2 build? I wonder if the virtualization is intercepting some AVX2 instruction and making everything slow.

You can compile MMseqs2 with:

cmake -DHAVE_SSE4_1=1 .......
make ......
xeniorn commented 1 year ago

Thanks for the response!

I had already done this via docker, which compiles all 3 versions (AVX2, SSE42, SSE2) - AVX2 was definitely the fastest of the three, and all three were behaving the same as the native build. I don't think it's likely I'd see a different results if not using a container. Do you know if running mmseqs with docker on your machine makes it drastically slower? Given that it's the "same speed" in my tests, it's hard to imagine that the docker adds anything of relevance in the equation.

I fully agree with you that the hardware can't be the source of the discrepancy here.

crisdarbellay commented 6 months ago

Hey, I'm having similar problem... Could you find a solution ?

xeniorn commented 6 months ago

@crisdarbellay no, sorry. I don't have access to a "bare metal" server to test this with, but my guess is that Milot guess that it has to do with virtualization is correct. Are you running it with KVM or not?

crisdarbellay commented 6 months ago

Thank you for your response. I'll check this!