sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
15 stars 2 forks source link

`manysketch` benchmarking #122

Open bluegenes opened 1 year ago

bluegenes commented 1 year ago

using some mgx colton was using for sketchall testing in https://github.com/sourmash-bio/sourmash/issues/2748:

data from /home/baumlerc/download-seq/download-seq/fastq/

total 23G
4.0G SRR12480103.fastq
1.4G SRR13122219.fastq
6.3G SRR8849208.fastq
5.7G SRR8849216.fastq
5.5G SRR8849289.fastq

ran in /home/ntpierce/2023-bench-manysketch:

/usr/bin/time -v sourmash scripts manysketch mgx.fromfile5.csv \
                                      -p dna,k=21,k=31,k=51,scaled=1000,abund \
                                      -c 6 -o mgx.fromfile5.zip

output:

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> pyo3_branchwater 0.8.0; cite Irber et al., doi: 10.1101/2022.11.02.514947

params: ['dna,k=21,k=31,k=51,scaled=1000,abund']
sketching all files in 'mgx.fromfile.csv' using 6 threads
Loaded 5 rows in total (5 genome and 0 protein files)
Starting 1th fasta file (20% of total)
Starting 2th fasta file (40% of total)
Starting 3th fasta file (60% of total)
Starting 4th fasta file (80% of total)
Writing manifest
DONE. Processed 5 fasta files
...manysketch is done! results in 'mgx.fromfile5.zip'

/usr/bin/time -v results: 42 mins

Command being timed: "sourmash scripts manysketch mgx.fromfile5.csv -p dna,k=21,k=31,k=51,scaled=1000,abund -c 6 -o mgx.fromfile5.zip"
User time (seconds): 4728.30
System time (seconds): 103.45
Percent of CPU this job got: 194%
Elapsed (wall clock) time (h:mm:ss or m:ss): 41:28.56
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 87214080
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 115
Minor (reclaiming a frame) page faults: 60219987
Voluntary context switches: 73402
Involuntary context switches: 288202
Swaps: 0
File system inputs: 77270248
File system outputs: 380376
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

sourmash sig summarize:

** loading from 'mgx.fromfile5.zip'
path filetype: ZipFileLinearIndex
location: /home/ntpierce/2023-bench-manysketch/mgx.fromfile5.zip
is database? yes
has manifest? yes
num signatures: 15
** examining manifest...
total hashes: 25830111
summary of sketches:
   5 sketches with DNA, k=31, scaled=1000, abund      8337030 total hashes
   5 sketches with DNA, k=51, scaled=1000, abund      10733459 total hashes
   5 sketches with DNA, k=21, scaled=1000, abund      6759622 total hashes
ctb commented 1 year ago

87 GB seems like an awful lot, though! Are the FASTQ files being read into memory completely or something?

bluegenes commented 1 year ago

87 GB seems like an awful lot, though! Are the FASTQ files being read into memory completely or something?

..yep . I knew there was something I forgot to fix in that PR. Fix in #123

bluegenes commented 1 year ago

using #123: 15min, 1.4Gb

...manysketch is done! results in 'mgx.fromfile5.zip'
        Command being timed: "sourmash scripts manysketch mgx.fromfile5.csv -p dna,k=21,k=31,k=51,scaled=1000,abund -c 6 -o mgx.fromfile5.zip"
        User time (seconds): 3168.33
        System time (seconds): 63.27
        Percent of CPU this job got: 364%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 14:46.04
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1373784
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 15
        Minor (reclaiming a frame) page faults: 3307897
        Voluntary context switches: 4935
        Involuntary context switches: 48727
        Swaps: 0
        File system inputs: 0
        File system outputs: 380344
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

sourmash sig summarize mgx.fromfile5.zip

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'mgx.fromfile5.zip'
path filetype: ZipFileLinearIndex
location: /home/ntpierce/2023-bench-manysketch/mgx.fromfile5.zip
is database? yes
has manifest? yes
num signatures: 15
** examining manifest...
total hashes: 25830111
summary of sketches:
   5 sketches with DNA, k=31, scaled=1000, abund      8337030 total hashes
   5 sketches with DNA, k=51, scaled=1000, abund      10733459 total hashes
   5 sketches with DNA, k=21, scaled=1000, abund      6759622 total hashes
ctb commented 1 year ago

see benchmarks for all of GTDB rs217 here: https://github.com/sourmash-bio/pyo3_branchwater/pull/96#issuecomment-1709190601

tl;dr 40 minutes, 64 threads, 2.7 GB of RAM.