Performance logging of ARGannot_r3.nonamb and all_ebi_plasmids.fa against 661K HQ indexes

leoisl commented 2 years ago

First run of mof-search querying ARGannot_r3.nonamb against the 661K HQ indexes without any error done! We used 1 thread for COBS and allowed at most 4 decompression jobs to run simultaneously. Performance is not what I expected: it took 9h47m and used 7GB of RAM, but we will improve this very soon.

The main takeaway from this is that we managed to query the 661k HQ indexes with absolutely 0 errors, meaning https://github.com/karel-brinda/mof-search/pull/34 worked (at least on linux)

karel-brinda commented 2 years ago

For the purpose of documentation, could you provide your hw configuration and the configuration that was used with mof-search (#cobs threads, etc.)? This might explain the discrepancy.

Also, do you know what time you spent in which phases (I usually run and measure make match and make map separately)?

For comparison, my values were following (previously):

1) matching using COBS

real    67m53.940s
user    116m44.494s
sys     49m49.453s

2) batch align with minimap & 8 cores

real    245m12.433s
user    1401m45.748s
sys     297m22.271s

leoisl commented 2 years ago

Yeah, I will do this, the main message was just that things are working with no issues. I just managed to make COBS work completely on bioconda on linux and mac, including tests. Until yesterday, all tests were failing on mac, now everything is working. I just asked bioconda team to review and merge this, so soon we will have COBS on bioconda which will solve lots of issues.

Meanwhile I will work on:

https://github.com/karel-brinda/mof-search/issues/43 so that it is much easier to play with the parameters and test without the risk of having more decompression rules running than COBS rules running, and the pipeline failing due to no disk space available;
https://github.com/karel-brinda/mof-search/issues/19 : so that we can just do make diagnostics and get the stats you mentioned;

karel-brinda commented 2 years ago

That's really cool that COBS already works. Once its packaging is finished, I'll resume testing the computations on my computers.

leoisl commented 2 years ago

I am also waiting for cobs v0.2.0 to be up in bioconda to resume testing

leoisl commented 2 years ago

It seems it takes some 4 days for the package to be up, so I think you will be only able to test it after your vacations!

karel-brinda commented 2 years ago

Interestingly, for Linux it's already there: https://anaconda.org/bioconda/cobs/files

karel-brinda commented 2 years ago

And for osx too: https://anaconda.org/bioconda/cobs/0.2.0/download/osx-64/cobs-0.2.0-ha276ee6_0.tar.bz2

leoisl commented 2 years ago

Hey, sorry I am not home. Can you see the new version with conda search -c bioconda cobs?

On Tue, 19 Jul 2022, 21:22 Karel Břinda, @.***> wrote:

And for osx too: https://anaconda.org/bioconda/cobs/0.2.0/download/osx-64/cobs-0.2.0-ha276ee6_0.tar.bz2

— Reply to this email directly, view it on GitHub https://github.com/karel-brinda/mof-search/issues/52#issuecomment-1189515052, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACACJD274HXQRJPMHOOVQ23VU4FABANCNFSM537Y7QBA . You are receiving this because you authored the thread.Message ID: @.***>

karel-brinda commented 2 years ago

No, it's not yet uploaded to Anaconda (https://conda.anaconda.org/bioconda/osx-64/) and I'm not sure how frequently this is done.

leoisl commented 2 years ago

Using the code in PR https://github.com/karel-brinda/mof-search/pull/80 and these relevant configs:

max_decomp_jobs: 4
cobs_thr: 2
minimap_thr: 2

I got these benchmark values:

match (this is decompress+COBS match)

Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_decomp_jobs=4 max_download_jobs=8 --config benchmark=True -- match
real(s) sys(s) user(s) percent_CPU max_RAM(kb) FS_inputs FS_outputs
21142.03 1540.98 5918.97 35% 7411168 458151880 2067547352

map

Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_decomp_jobs=4 max_download_jobs=8 --config benchmark=True -- map
real(s) sys(s) user(s) percent_CPU max_RAM(kb) FS_inputs FS_outputs
16905.88 6919.36 68620.11 446% 208296 62321536 5568368

~5.9h to match and ~4.7h to map. I think the important thing to realise here is that, in both benchmarks, I think we could have at most 3 processes running (my machine has 8 cores, 1 is allocated to snakemake if I am not wrong, and the other 7 for the jobs, but each job here takes 2 threads). With 3 processes simultaneously, using at most 2 threads each, our percent_CPU is capped at 600%. For the map part, this look good, we got 446% CPU, which is somewhat close to this max. For the match job though, this is abysmal. We just got 35% CPU, which is ~17 times less parallelism than the theoretical best. Note also the discrepancy we find between the real and sys/user time, and the huge amount of FS outputs for match (2B for match vs 5M for map, 400 times more for match) and FS inputs (458M for map vs 62M for map, 7 times more for match). I think all this points that the processes in the match part were a lot of time blocked waiting I/O.

There are some takeaways from this experiment:

The match part is I/O bound, while the map part seems to be CPU bound. Splitting the benchmark in these two parts makes the match part slow I think. I think we can get much more speed by running the whole pipeline, mixing minimap2 jobs with COBS jobs, so that we don't overload I/O;

This is the COBS match output sorted by date:

-r--r--r-- 1 leandro leandro  55K Jul 22 21:02 dustbin__10____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  55K Jul 22 21:02 dustbin__15____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  38K Jul 22 21:02 dustbin__21____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  47K Jul 22 21:02 dustbin__08____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  44K Jul 22 21:09 dustbin__12____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  38K Jul 22 21:09 dustbin__04____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  43K Jul 22 21:09 dustbin__06____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  44K Jul 22 21:09 dustbin__11____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  55K Jul 22 21:16 dustbin__09____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  42K Jul 22 21:16 dustbin__16____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  47K Jul 22 21:16 dustbin__05____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  43K Jul 22 21:16 dustbin__02____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  44K Jul 22 21:23 dustbin__13____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  61K Jul 22 21:23 dustbin__18____ARGannot_r3.nonamb.xz
-r--r--r-- 1 leandro leandro  40K Jul 22 21:23 dustbin__03____ARGannot_r3.nonamb.xz

For some reason, the pipeline starts by decompressing all dustbins first, which is the largest xz files we have. I think it would have been better if the order was random, so we would have one or two dustbins at a time, but IDK how snakemake chooses jobs and I've noticed that starting by dustbins seems to be systematic.

I will first try 1 to see how it goes

karel-brinda commented 2 years ago

Here are my results. Given the character of data, I made the filtering part less strict.

Dataset

ARGannot_r3.fa

Config

thr: all
batches: "batches_full.txt"
max_decomp_jobs: 4
max_download_jobs: 20
cobs_thr: 1
cobs_kmer_thres: 0.70
minimap_preset: "sr"
minimap_thr: 1
minimap_extra_params: "--eqx"
keep_cobs_ind: False
download_dir: "."
benchmark: False

Matching

Note that several (~7) processes stalled and it's not

$ cat match.txt
Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_decomp_jobs=4 max_download_jobs=20 --config benchmark=True -- match
real(s) sys(s) user(s) percent_CPU max_RAM(kb) FS_inputs FS_outputs
4007.72 4082.04 7280.55 283% 2501552 0 25

leoisl commented 2 years ago

1 from https://github.com/karel-brinda/mof-search/issues/52#issuecomment-1193107770 is not an option (see https://github.com/karel-brinda/mof-search/issues/89). Trying 2

leoisl commented 2 years ago

Best performance so far for ARG annot on my machine

Code

Code used was the one in PR https://github.com/karel-brinda/mof-search/pull/95

Dataset

ARGannot_r3.fa

Config

thr: all
batches: "batches_full.txt"
max_download_jobs: 8
cobs_thr: 1
cobs_kmer_thres: 0.33
minimap_preset: "sr"
minimap_thr: 1
minimap_extra_params: "--eqx"
max_decomp_MB: 2000
keep_cobs_ind: False
download_dir: "."
benchmark: False

Match

Total time: 2.75h

Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_decomp_MB=2000 max_download_jobs=8 --config benchmark=True -- match
real(s) sys(s) user(s) percent_CPU max_RAM(kb) FS_inputs FS_outputs
9922.50 1174.58 6463.51 76% 7045748 458257680 2067547528

Map

Total time: 3.35h:

(base) leandro@leandro-XPS-13-9310-2-in-1:~/git/mof-search$ cat logs/benchmarks/map.txt 
Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_decomp_MB=2000 max_download_jobs=8 --config benchmark=True -- map
real(s) sys(s) user(s) percent_CPU max_RAM(kb) FS_inputs FS_outputs
12046.00 8646.27 82699.43 758% 208156 63913744 5574808

Total

6.1h

Improvements to be made

At this point, all I can think is to use a Linux machine without disk encryption

karel-brinda commented 2 years ago

This is a great improvement, congrats!

At this point, all I can think is to use a Linux machine without disk encryption

Yes, COBS ideally requires an SSD disk w/o encryption.

Btw. For the resistance genes, we can use cobs_kmer_thres: 0.7, which should be substantially faster.

karel-brinda commented 2 years ago

(This is a copy of https://github.com/karel-brinda/mof-search/issues/83#issuecomment-1199985952)

Version: * 7574ca6 - (HEAD) Dynamically reducing the number of data we write to the pipe (2 days ago) <Leandro Ishi>

Query: ARGannot_r3.fa

Finished successfully, no stalled processes now.

Configuration:

thr: all
batches: "batches_full.txt"
max_download_jobs: 8
cobs_thr: 1
cobs_kmer_thres: 0.70
minimap_preset: "sr"
minimap_thr: 1
minimap_extra_params: "--eqx"
max_decomp_MB: 2000
keep_cobs_ind: False
download_dir: "."
benchmark: False

Benchmarks:

==> map.txt <==
Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_decomp_MB=2000 max_download_jobs=8 --config benchmark=True -- map
real(s) sys(s) user(s) percent_CPU max_RAM(kb) FS_inputs FS_outputs
19857.93 21573.95 88627.35 554% 374220 0 0

==> match.txt <==
Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_decomp_MB=2000 max_download_jobs=8 --config benchmark=True -- match
real(s) sys(s) user(s) percent_CPU max_RAM(kb) FS_inputs FS_outputs
6460.46 3640.45 6579.11 158% 2517436 0 38

leoisl commented 2 years ago

I got the machine detailed in https://github.com/karel-brinda/mof-search/issues/96#issuecomment-1196459483 but I need to configure it. Will be able to do it this Tuesday, then I can restart the benchmarking.

karel-brinda commented 2 years ago

Cool! How did it go?

leoisl commented 2 years ago

Couldn't run it before holidays, running now on ARG Annot

leoisl commented 2 years ago

Resurrecting this issue. Benchmarking on recent code:

Code

Current master (https://github.com/karel-brinda/mof-search/commit/8567c46d3ef8be07253b7fc20f8360b61feace5d)

Dataset

all_ebi_plasmids.fa

Config

batches: "data/batches_full.txt"
cobs_kmer_thres: 0.4
nb_best_hits: 100
minimap_preset: "asm20"
minimap_extra_params: "--eqx"
threads: all
max_ram_gb: 12
max_download_threads: 8
cobs_threads: 3
load_complete: True
max_io_heavy_threads: 6
minimap_threads: 1
prefer_pipe: True
keep_cobs_indexes: False
download_dir: "."

Final stats

queries 2826
cumul_length_bps        223662732
matched_queries 442
aligned_queries 441
aligned_segments        375970
distinct_genome_query_pairs     26955
target_genomes  3748
target_batches  25
nonalignments   8

Match

Total time: 7.25h

# Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=6 max_ram_mb=12288 -- match
real(s) sys(s)  user(s) percent_CPU     max_RAM(kb)     FS_inputs       FS_outputs      elapsed_time_alt(s)
26091.42        1795.23 124095.18       482%    10699284        273284592       2068807800      26091.445251

Map

Total time: 1.58h:

# Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=6 max_ram_mb=12288 -- map
real(s) sys(s)  user(s) percent_CPU     max_RAM(kb)     FS_inputs       FS_outputs      elapsed_time_alt(s)
5704.15 2276.19 26218.29        499%    878216  59028240        6016088 5704.174802

Total

8.83h

leoisl commented 2 years ago

I think the previous run went fine. When I count the SAM flags of a very old run on the EBI plasmids, I have this SAM flags count:

xzcat output/all_ebi_plasmids.sam_summary.xz | grep -v "^==" | awk '{print $2}' | sort | uniq -c
    304 
  97788 0
  96106 16
1381910 2048
1408813 2064
    492 4

I think the sum of flags 0 and 16 (~200k) is on the order of aligned_segments 375970. But we also had different parameters, here I am using --asm20 for minimap2, whereas before I was using -sr. Anyway, would you know if these final stats looks ok @karel-brinda ? Or did we maybe introduce a bug?

karel-brinda commented 2 years ago

If this is the plasmid dataset, there must be something wrong with the results.

I'm getting 1,871 aligned queries and 8,980,408 (genome, query) pairs.

The total real time on my rather fast computer is 26.9 hours.

Update: The difference is I take 1000 best hits for alignment.

leoisl commented 2 years ago

Update: The difference is I take 1000 best hits for alignment.

Do you have stats for taking 100 best hits?

karel-brinda / Phylign