Open xeniorn opened 1 year ago
This seems pretty puzzling, we are running on an older CPU model and nothing else seems out of order.
MPI won't help and will only add additional latency. MMseqs2 can be used in target split mode without MPI, but we don't do this either.
Doing additional splits will also only add latency (1 split with enough RAM to fit the whole .idx file and some spare is ideal).
RAM disks should also not help, in the worst case they might even be harmful. The index is already fully memory mapped and resident in the OS page cache due to vmtouch.
Maybe the KVM virtualization has some of issue? We are running bare metal.
Running only the MMseqs2 prefilter with perf might help debugging the performance issue. I can write more detailed instructions later if needed.
I took a closer look at the CPU flags. I think your KVM is disallowing sse41 and avx2. At least sse41 is required for reasonable performance.
Okay maybe that's not the issue since cmake claims to have found the various SIMD instruction sets to work:
-- Performing Test HAVE_AVX2_EXTENSIONS
-- Performing Test HAVE_AVX2_EXTENSIONS - Success
-- Performing Test HAVE_AVX_EXTENSIONS
-- Performing Test HAVE_AVX_EXTENSIONS - Success
-- Performing Test HAVE_SSE4_2_EXTENSIONS
-- Performing Test HAVE_SSE4_2_EXTENSIONS - Success
-- Performing Test HAVE_SSE4_1_EXTENSIONS
-- Performing Test HAVE_SSE4_1_EXTENSIONS - Success
-- Performing Test HAVE_SSE3_EXTENSIONS
-- Performing Test HAVE_SSE3_EXTENSIONS - Success
-- Performing Test HAVE_SSE2_EXTENSIONS
-- Performing Test HAVE_SSE2_EXTENSIONS - Success
-- Performing Test HAVE_SSE_EXTENSIONS
-- Performing Test HAVE_SSE_EXTENSIONS - Success
Happy to run any additional tests you might have in mind or provide any further info.
"Running only the MMseqs2 prefilter with perf might help debugging the performance issue."
Happy to take your code if provided, but I assume just copying out the first prefilter command printed in the log would do? E.g.
prefilter out/qdb_multi /scratch-cbe/users/juraj.ahel/db_copy/uniref30_2202_db.idx out/tmp/13555276572390281633/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 8 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 2 --compressed 0 -v 3
If it would be about lack of SIMD support, could I somehow see it in the runtime, whether with existing compilation or some flag that makes it crash in absence of AVX2?
I just remembered the compilation was done on the login node of the cluster, possibly there is an issue with the compute nodes. Will try and rerun cmake there.
Exactly, run the prefilter command with:
perf record -g -F 250 -- mmseqs prefilter ....rest-of-arguments---
This will generate a perf.data
file that contains the perfomance profiling information.
It can be viewed with
perf report
or
perf annotate
You should compile with cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo
so that mmseqs contains debug symbols. This is not 100% the same as RelWithDebInfo uses -O2
and Release uses -O3
but it shouldn't really have a huge difference.
perf is not available on the high mem node on which I'm running the real workload, but is oddly available on the login node. Would that output be helpful?
+ 51.95% 32.34% mmseqs_perf libc-2.17.so [.] __memcpy_ssse3_back
+ 45.94% 29.54% mmseqs_perf mmseqs [.] QueryMatcher::match
+ 35.99% 35.99% mmseqs_perf [unknown] [k] 0xffffffffa55bb7c0
0.63% 0.58% mmseqs_perf mmseqs [.] KmerGenerator::calculateArrayProduct
+ 0.56% 0.56% mmseqs_perf mmseqs [.] memcpy@plt
0.22% 0.22% mmseqs_perf ld-2.17.so [.] do_lookup_x
0.20% 0.20% mmseqs_perf libc-2.17.so [.] __memset_sse2
0.20% 0.00% mmseqs_perf [unknown] [.] 0x0000000000000030
0.15% 0.00% mmseqs_perf [unknown] [.] 0000000000000000
0.14% 0.14% mmseqs_perf libc-2.17.so [.] _IO_file_underflow@@GLIBC_2.2.5
0.13% 0.13% mmseqs_perf ld-2.17.so [.] _dl_lookup_symbol_x
0.13% 0.00% mmseqs_perf ld-2.17.so [.] _dl_sysdep_start
0.13% 0.00% mmseqs_perf ld-2.17.so [.] dl_main
0.13% 0.00% mmseqs_perf ld-2.17.so [.] _dl_relocate_object
0.06% 0.06% mmseqs_perf [unknown] [k] 0xffffffffa55c6280
I will see if I can get perf on the real node if needed.
That's with the RelWithDebInfo-compiled version, on the login node without vmtouch, where the workload took 600 s.
On the compute node with vmtouch, it's 10 seconds, same goes for the release-compiled version, so indeed does not seem to make a difference.
Could you please also let me know what the runtime would be for such inputs on your machine? Just to have a target number in mind.
One other thing, not sure if helpful but to me it seems somewhat indicative. For mmseqs search, I'm getting an t=aN+b kind of dependence, where t is runtime and N number of provided sequences in the fasta file. The a factor seems actually quite fine, it's the b that makes it problematic. Otherwise it would be ~5 seconds to search through uniref for one 1:1 complex (150+70 aa), and presumably ~15 seconds for envdb, assuming linear scaling. Together with the non-search steps, that would fit into runtime we're getting from the colabfold server.
If you know the environment on which the colabfold server is running and library versions it was compiled against (or any other machine on which expected runtime is experienced), I can try to reproduce it in more similar conditions.
Ok, got the perf data from the actual compute node, see attached. perf.zip
Could this have anything to do with NUMA topology? Are your servers NUMA-less? Considering that all our high-mem nodes have multiple NUMA nodes, with at most ~370G per NUMA node, I'm thinking sub-optimal memory access might be the cause.
Sorry I was traveling. I don't see anything out of the ordinary in your screenshot. I still don't really have an idea whats going on. I repeated the UniRef search with your query on the MSA server:
Command being timed: "mmseqs search mmseq_in_db uniref30_2202_db mmseq_in_res mmseq_in_tmp --threads 1 --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000"
User time (seconds): 13.10
System time (seconds): 21.13
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:34.23
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 19188072
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 9091215
Voluntary context switches: 57
Involuntary context switches: 503
Swaps: 0
File system inputs: 0
File system outputs: 8888
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
One thing we do differently since we still have enough spare CPU capacity on the server is enable some limited multithreading and parallel processing.
We use --threads 2
, which will hide quite a bit of overhead in your use-case:
User time (seconds): 12.32
System time (seconds): 20.86
Percent of CPU this job got: 132%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.12
And we also start the env-db and template search in parallel to the expand-align steps.
(path.colabfold.parallelstages
parameter in the API server). This will hide quite a bit of latency. However, the initial search is still a lot faster for us despite worse CPU/RAM specs.
Can you check if vmtouch
has actually but everything into RAM? htop
should look something like this:
root 20 0 605G 605G 605G S 0.0 15.0 6:22.98 /home/steineggerlab/miniforge3/bin/vmtouch -f -t -l -m 4000 colabfold_envdb_202108.idx pdb70_220313.idx uniref30_2...
I don't think its a NUMA issue either. Our system should be affected worse by NUMA issues since its a 8-socket system.
I am not finding the numbers now, but I had tested the effect of multiprocessing, both by giving it between 1 and 64 cpus (with fixed 2 threads per cpu), and by giving between 1 and 8 threads on a single cpu. Neither had a notable difference with this two-sequence input. With somewhat larger batches (e.g. 60) I could see it speeding up, but with diminishing returns as Nproc > Nseq/2 or so.
Parallel processing of the uniref and envdb we intend to do, as you say. Also reusing the uniref30 search (your app does it once for the MSA/monomer step and then again for the pair step, if I'm not mistaken).
Thank you for the info about the NUMA setup, I'll drop that line of exploration then.
For RAM - I had checked with vmtouch -v [file] and it was claiming to have all pages resident in RAM. Can't double check now with htop as all the high mem nodes are heavily used at the moment, but I can't imagine it would be reporting something different. The timing is very consistent with vmtouch, and an order of magnitude slower than when reading from disk for us.
Htop reports all (uniref30 db idx file) is in memory:
CPU PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
4 65761 xxx.xxx 20 0 156G 156G 156G S 0.0 10.8 1:46.40 /software/extra/em/vmtouch/1.3.1/bin/vmtouch -ldw (...)
@milot-mirdita are you by any chance aware of a specific user that's running local mmseqs server with equivalent performance to what you have? I think you could be on the right track with the KVM, but I'd prefer to first check in with others to see there's correlation between KVM/bare-metal and this performance wall.
Not personally. But we have used the server on two completely differently spec-ed servers and the performance is okay both time:
As you can see the two systems are very different. However, we never tried to run on top of any virtualization. That seems to be the only really different variable. (The RAM on both systems is overkill, but these are the servers we have).
Could you try to run a SSE41 build and not an AVX2 build? I wonder if the virtualization is intercepting some AVX2 instruction and making everything slow.
You can compile MMseqs2 with:
cmake -DHAVE_SSE4_1=1 .......
make ......
Thanks for the response!
I had already done this via docker, which compiles all 3 versions (AVX2, SSE42, SSE2) - AVX2 was definitely the fastest of the three, and all three were behaving the same as the native build. I don't think it's likely I'd see a different results if not using a container. Do you know if running mmseqs with docker on your machine makes it drastically slower? Given that it's the "same speed" in my tests, it's hard to imagine that the docker adds anything of relevance in the equation.
I fully agree with you that the hardware can't be the source of the discrepancy here.
Hey, I'm having similar problem... Could you find a solution ?
@crisdarbellay no, sorry. I don't have access to a "bare metal" server to test this with, but my guess is that Milot guess that it has to do with virtualization is correct. Are you running it with KVM or not?
Thank you for your response. I'll check this!
Some issues / unexpected behavior when running mmseqs search and other features used in colabfold. But the low speed happens in the mmseqs step, so I think this is the place to ask.
Unexpected observations:
Welcoming any input! Many thanks!
Expected Behavior
Runtime of mmseqs2 for a given input similar to response time from the colabfold server.
Current Behavior
Considerably slower. For a given input (~250 aa) the server gives a full result in 20-30s, paired + unpaired reads (so, including at least 1 search each against uniref30 and colabfold_envdb, plus a number of expandaln / align calls). The local version takes 60-90 seconds to run just the initial mmseqs search against uniref30, when using vmtouch and db load mode 2, and ~300 seconds against colabfold_envdb. Db load mode 3 somewhat slower, db load mode 0 and 1 much slower as expected. Index was created without splits. The mmseqs search step is timed after vmtouch and other preparations are complete. The runtime is independent of either thread count or number of cpus provided. Splitting the target db index 2 4 8 or 16-fold and running in parallel with MPI does not speed it up either (slightly slows down). Placing the entire run directory fast SSD or RAM (/dev/shm) does not change the speed. This happens with the docker version (either avx, sse2 or sse41 versions), and the self-compiled version.
This also happens when a random aminoacid sequence of equivalent length is used as input, so I'm excluding caching on the colabfold api server side as a cause for it to be so fast.
Steps to Reproduce (for bugs)
MMseqs Output (for bugs)
attached out_apptainer_run.log
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
mmseqs2 docker: 7e2840992948ee89dcc336522dc98a74fe0adf00
self-compiled: ed4c55fa21c402e0a9497e5db5cd4b687a75fa02 ed4c55fa21c402e0a9497e5db5cd4b687a75fa02-MPI
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: cmake version 3.21.0 cmake -DHAVE_MPI=0 -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. .. && make && make install cmake -DHAVE_MPI=1 -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. .. && make && make install
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): tried multiple servers, with available memory ranging from 100G to 2000G support for AVX2/SSE all there (see attached cmake output) cmake_out.txt
Operating system and version: LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009 Codename: Core
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 152 On-line CPU(s) list: 0-151 Thread(s) per core: 2 Core(s) per socket: 19 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz Stepping: 4 CPU MHz: 2000.000 BogoMIPS: 4000.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-37 NUMA node1 CPU(s): 38-75 NUMA node2 CPU(s): 76-113 NUMA node3 CPU(s): 114-151 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology