Open bluegenes opened 7 months ago
🚀 5 seconds on gtdb-rs214-reps
with average_containment_ani
default threshold (0.95)
I used 16 threads but %CPU was 123% (which makes sense, since cluster
is not actually parallelized)
generating clusters for comparisons in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv' using 16 threads
...clustering is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters.csv'
cluster counts in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters.sizes.csv'
Command being timed: "sourmash scripts cluster gtdb-rs214-reps.k31.pairwise-ani-all.csv -o gtdb-rs214-reps.k31.pairwise-ani-all.clusters.csv --similarity-column average_containment_ani --cluster-sizes gtdb-rs214-reps.k31.pairwise-ani-all.clusters.sizes.csv"
User time (seconds): 4.03
System time (seconds): 2.07
Percent of CPU this job got: 123%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.95
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 109292
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 36690
Voluntary context switches: 3028
Involuntary context switches: 424
Swaps: 0
File system inputs: 0
File system outputs: 3264
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
doesn't change much for lowered threshold
generating clusters for comparisons in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv' using 16 threads
...clustering is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.clusters0.8.csv'
cluster counts in 'None'
Command being timed: "sourmash scripts cluster gtdb-rs214-reps.k31.pairwise-ani-all.csv -o gtdb-rs214-reps.k31.pairwise-ani-all.clusters0.8.csv --similarity-column average_containment_ani --threshold 0.8"
User time (seconds): 3.76
System time (seconds): 1.87
Percent of CPU this job got: 125%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.47
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 126164
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 41457
Voluntary context switches: 3204
Involuntary context switches: 716
Swaps: 0
File system inputs: 0
File system outputs: 1968
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
pairwise
to build cluster
input file took much longer, of course. ~2 hours for gtdb-rs214-reps
using 16 threads
No ANI, no write-all:
DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise.csv'
Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip -o gtdb-rs214-reps.k31.pairwise.csv"
User time (seconds): 143454.56
System time (seconds): 136.08
Percent of CPU this job got: 1562%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:33:08
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4573808
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 17555
Minor (reclaiming a frame) page faults: 79509579
Voluntary context switches: 1134599
Involuntary context switches: 1262188
Swaps: 0
File system inputs: 4486144
File system outputs: 412944
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
ANI, no write-all
DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani.csv'
Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --ani -o gtdb-rs214-reps.k31.pairwise-ani.csv"
User time (seconds): 143272.02
System time (seconds): 80.51
Percent of CPU this job got: 1562%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32:51
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4573456
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 32007457
Voluntary context switches: 1181205
Involuntary context switches: 1298635
Swaps: 0
File system inputs: 0
File system outputs: 528008
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
ANI + write-all:
DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv'
Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --write-all --ani -o gtdb-rs214-reps.k31.pairwise-ani-all.csv"
User time (seconds): 107618.34
System time (seconds): 245.74
Percent of CPU this job got: 1551%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:55:51
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4575736
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 89
Minor (reclaiming a frame) page faults: 63129113
Voluntary context switches: 1118873
Involuntary context switches: 1699826
Swaps: 0
File system inputs: 13792
File system outputs: 547384
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
it seems a little weird that ANI + write-all took half an hour less wall time, right? But that could just be fluctuations on the computer running things.
benchmarking pairwise using GTDB-rs214 reps on 64 threads for comparison with multisearch (#89)
85205 x 85205 pairwise comparisons (3.6 billion comparisons non-self, non-redundant comparisons) in 44m with 64 threads (and 4.56 GB RAM).
DONE. Processed 3629903410 comparisons
...pairwise is done! results in 'gtdb-rs214-reps.k31.pairwise-ani-all.csv'
Command being timed: "sourmash scripts pairwise gtdb-rs214-reps.k31.zip --write-all --ani -o gtdb-rs214-reps.k31.pairwise-ani-all.csv"
User time (seconds): 149275.64
System time (seconds): 54.49
Percent of CPU this job got: 5612%
Elapsed (wall clock) time (h:mm:ss or m:ss): 44:20.68
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4566188
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 18246
Minor (reclaiming a frame) page faults: 6700145
Voluntary context switches: 1193450
Involuntary context switches: 1579877
Swaps: 0
File system inputs: 4610752
File system outputs: 547336
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
review comment: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/234#issuecomment-1966691225
If benchmark is slow, consider parallelizing reading. It was originally done in #234 but removed for simplicity.