sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
15 stars 2 forks source link

benchmark `pairwise` --> `cluster` #247

Open bluegenes opened 7 months ago

bluegenes commented 7 months ago

review comment: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/234#issuecomment-1966691225

could you add some minimal benchmarks (time/memory) for a standard-ish comparison, e.g. gtdb-reps, so that users know what to expect from both pairwise and cluster for a real-ish analysis? ISTR it's pretty fast against gtdb-reps.

If benchmark is slow, consider parallelizing reading. It was originally done in #234 but removed for simplicity.

pairwise files can be millions of lines long. Would it be faster to parallel read them, store them in an edges vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from names_to_node that are not already in the graph to preserve singletons.

bluegenes commented 7 months ago

🚀 5 seconds on gtdb-rs214-reps with average_containment_ani default threshold (0.95)

I used 16 threads but %CPU was 123% (which makes sense, since cluster is not actually parallelized)

generating clusters for comparisons in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv' using 16 threads
...clustering is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters.csv'
                       cluster counts in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters.sizes.csv'
        Command being timed: "sourmash scripts cluster gtdb-rs214-reps.k31.pairwise-ani-all.csv -o gtdb-rs214-reps.k31.pairwise-ani-all.clusters.csv --similarity-column average_containment_ani --cluster-sizes gtdb-rs214-reps.k31.pairwise-ani-all.clusters.sizes.csv"
        User time (seconds): 4.03
        System time (seconds): 2.07
        Percent of CPU this job got: 123%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.95
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 109292
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 36690
        Voluntary context switches: 3028
        Involuntary context switches: 424
        Swaps: 0
        File system inputs: 0
        File system outputs: 3264
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

doesn't change much for lowered threshold

generating clusters for comparisons in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv' using 16 threads
...clustering is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters0.8.csv'
                       cluster counts in 'None'
        Command being timed: "sourmash scripts cluster gtdb-rs214-reps.k31.pairwise-ani-all.csv -o gtdb-rs214-reps.k31.pairwise-ani-all.clusters0.8.csv --similarity-column average_containment_ani --threshold 0.8"
        User time (seconds): 3.76
        System time (seconds): 1.87
        Percent of CPU this job got: 125%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 126164
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 41457
        Voluntary context switches: 3204
        Involuntary context switches: 716
        Swaps: 0
        File system inputs: 0
        File system outputs: 1968
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
bluegenes commented 7 months ago

pairwise to build cluster input file took much longer, of course. ~2 hours for gtdb-rs214-reps using 16 threads

No ANI, no write-all:

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip -o gtdb-rs214-reps.k31.pairwise.csv"
        User time (seconds): 143454.56
        System time (seconds): 136.08
        Percent of CPU this job got: 1562%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:33:08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4573808
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 17555
        Minor (reclaiming a frame) page faults: 79509579
        Voluntary context switches: 1134599
        Involuntary context switches: 1262188
        Swaps: 0
        File system inputs: 4486144
        File system outputs: 412944
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ANI, no write-all

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --ani -o gtdb-rs214-reps.k31.pairwise-ani.csv"
        User time (seconds): 143272.02
        System time (seconds): 80.51
        Percent of CPU this job got: 1562%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4573456
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 32007457
        Voluntary context switches: 1181205
        Involuntary context switches: 1298635
        Swaps: 0
        File system inputs: 0
        File system outputs: 528008
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ANI + write-all:

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --write-all --ani -o gtdb-rs214-reps.k31.pairwise-ani-all.csv"
        User time (seconds): 107618.34
        System time (seconds): 245.74
        Percent of CPU this job got: 1551%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:55:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4575736
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 89
        Minor (reclaiming a frame) page faults: 63129113
        Voluntary context switches: 1118873
        Involuntary context switches: 1699826
        Swaps: 0
        File system inputs: 13792
        File system outputs: 547384
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
ctb commented 6 months ago

it seems a little weird that ANI + write-all took half an hour less wall time, right? But that could just be fluctuations on the computer running things.

bluegenes commented 6 months ago

benchmarking pairwise using GTDB-rs214 reps on 64 threads for comparison with multisearch (#89)

85205 x 85205 pairwise comparisons (3.6 billion comparisons non-self, non-redundant comparisons) in 44m with 64 threads (and 4.56 GB RAM).

DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv'
        Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --write-all --ani -o gtdb-rs214-reps.k31.pairwise-ani-all.csv"
        User time (seconds): 149275.64
        System time (seconds): 54.49
        Percent of CPU this job got: 5612%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 44:20.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4566188
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 18246
        Minor (reclaiming a frame) page faults: 6700145
        Voluntary context switches: 1193450
        Involuntary context switches: 1579877
        Swaps: 0
        File system inputs: 4610752
        File system outputs: 547336
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0