sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

Slow run time of `sourmash sig subtract` when re-implemented in python API #2248

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago

I recently wrote a snakefile to subtract a one signature from another. I noticed that my implementation of sourmash sig subtract, which I wrote using the python API but basically copied from sourmash src, was super slow. Signature loading was slow, but subtracting was the slowest step (discovered with liberal print statements). I then switched it to running in sourmash sig subtract using the cli, but wrapped in shell(), so still executed within python. That was also slow in the same ways as above. Lastly, I ran it just on the command line, and then it was fast again!

I looked at memory while these were running at it consistently stayed around 3GB (on a 64 gb machine). I also tried giving each rule 64GB of memory in case snakemake was silently limiting resources, and there was no change in perceived runtime.

I realize that the two approaches that use just python are written so that ALL signatures in the metatdata file would be run before the rule finished, while the one that uses the CLI operates on one sample (two sigs) at a time. However, I was monitoring this just running on the one sample provided here, and the first two implementations below were WAY slower than just using the cli. (this isn't really important for the reprex, as I only included on sample in the metadata table).

I have a working solution for what I want to do, but I wanted to record what I felt like was weird behavior that I couldn't figure out an explanation for.

time sourmash sig subtract -k 31 -o C_2_1.sig outputs/sourmash_sketch/SRR6032600.sig  outputs/sourmash_sketch/SRR6032602.sig

== This is sourmash version 4.4.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Cannot use subtract on signatures with abundance tracking, sorry!

real    0m14.464s
user    0m13.376s
sys 0m2.808s

I'm including a snakefile here that reproduces the behavior I observed. The metadata file (attached) I included only runs the code for one sample (two SRA run accessions) so it will hopefully be ~quick. You'll need sourmash, pandas, sra tools, and snakemake in your environment for the snakefiles to run.

metadata.txt

Conda environment file:

channels:
   - conda-forge
   - bioconda
   - defaults
dependencies:
   - sourmash-minimal=4.4.3
   - sra-tools=2.11.0
   - snakemake-minimal=7.12.1
   - pandas=1.4.3
import pandas as pd
import sourmash
import os

# create wildcard variables for workflow
metadata = pd.read_csv("metadata.txt", sep = "\t")            # read in metadata as a pandas dataframe
MTX = metadata['mtx_run_accession'].unique().tolist()         # make run accession in mtx col into list
MGX = metadata['mgx_run_accession'].unique().tolist()         # make run accession in mgx col into list
RUN_ACCESSIONS = MGX + MTX                                    # combine mtx and mgx into one list with all run accessions
SAMPLES = metadata['sample_name'].unique().tolist()           # make a list of sample names
KSIZES = [31]                                                 # create a list of k-mer sizes for the workflow

MTX_MINUS_MGX = [x + '-minus-' + y for x, y in zip(MTX, MGX)]
print(MTX_MINUS_MGX)

rule all:
    input:
        expand("outputs/sourmash_sketch_subtract/{sample}_k{ksize}.sig", sample = SAMPLES, ksize = KSIZES)
        expand("outputs/sourmash_sketch_subtract/{mtx_minus_mgx}-k{ksize}.sig", mtx_minus_mgx = MTX_MINUS_MGX, ksize = KSIZES)

rule sourmash_sketch:
    output: "outputs/sourmash_sketch/{run_accession}.sig"
    shell:'''
    fastq-dump --disable-multithreading --fasta 0 --skip-technical --readids --read-filter pass --dumpbase --split-spot --clip -Z {wildcards.run_accession} |
        sourmash sketch dna -p k=21,k=31,k=51,scaled=200,abund --name {wildcards.run_accession} -o {output} -
    '''

# This is super slow. It basically re-implements sourmash sig subtract but removing some loops i didn't need
rule sig_subtract_in_python_api:
    input:
        metadata = 'metadata.txt',
        sigs = expand('outputs/sourmash_sketch/{run_accession}.sig', run_accession = RUN_ACCESSIONS)
    output:
        sigs = expand("outputs/sourmash_sketch_subtract/{sample}_k{{ksize}}.sig", sample = SAMPLES)
    benchmark: "benchmark_api_full.txt"
    run:
        ksize = wildcards.ksize
        # read in metadata dataframe to derive sample pairs
        metadata2 = pd.read_csv("metadata.txt", sep = "\t") # read in metadata as a pandas dataframe
        metadata2 = metadata2.reset_index()  # make sure indexes pair with number of rows
        for index, row in metadata2.iterrows():
            # grab the run accessions for a given paired metagenome and metatranscriptome
            mtx_run_accession = row['mtx_run_accession']
            mgx_run_accession = row['mgx_run_accession']
            # using the run assession, create paths of the signatures
            mtx_sigfile = os.path.join('outputs/sourmash_sketch', mtx_run_accession + '.sig')
            mgx_sigfile = os.path.join('outputs/sourmash_sketch', mgx_run_accession + '.sig')

            # read in mtx signature and grab both the minhash object and the hashes in that object
            mtx_sigobj = sourmash.load_one_signature(mtx_sigfile, ksize=ksize, select_moltype='DNA')
            mtx_mh = mtx_sigobj.minhash

            # turn the mtx hashes into a set
            mtx_subtract_mins = set(mtx_mh.hashes)

            # read in mgx signature
            mgx_sigobj = sourmash.load_one_signature(mgx_sigfile, ksize=ksize, select_moltype='DNA')

            # do the subtraction
            mtx_subtract_mins -= set(mgx_sigobj.minhash.hashes)

            # make a new signature that only contains the mtx hashes that were not in the mgx
            mtx_subtract_mh = mtx_sigobj.minhash.copy_and_clear().flatten()
            mtx_subtract_mh.add_many(mtx_subtract_mins)

            # re-establish hash abundances from mtx sig
            # re-read in mtx sig just in case abundances were removed in place
            abund_sig = sourmash.load_one_signature(mtx_sig_path, ksize=ksize, select_moltype='DNA')
            mtx_subtract_mh = mtx_subtract_mh.inflate(abund_sig.minhash)

            # create a new sig object that can be written to file
            mtx_subtract_sigobj = sourmash.SourmashSignature(mtx_subtract_mh)

            # create a name that reflects the signature contents
            name = row['sample_name']
            mtx_subtract_sigobj.name = name

            # create output sig file name
            sig_filename = os.path.join('outputs/sourmash_sketch_subtract/' + name + "_k" + ksize + ".sig")
            # write sig to file
            with sourmash.sourmash_args.FileOutput(sig_filename, 'wt') as fp:
                sourmash.save_signatures([mtx_subtract_sigobj], fp=fp)

# This is also slow -- commented out because it produces the same output as the above rule
#rule sig_subtract_in_python_by_shell:
#    input:
#        metadata = 'inputs/metadata-paired-mgx-mtx.tsv',
#        sigs = expand('outputs/sourmash_sketch/{run_accession}.sig', run_accession = RUN_ACCESSIONS)
#    output:
#        sigs = expand("outputs/sourmash_sketch_subtract/{sample}_k{{ksize}}.sig", sample = SAMPLES)
#    benchmark: "benchmark_api_shell.txt"
#    run:
#        ksize = wildcards.ksize
#        # read in metadata dataframe to derive sample pairs
#        metadata2 = pd.read_csv("metadata.tsv", sep = "\t") # read in metadata as a pandas dataframe
#        metadata2 = metadata2.reset_index()  # make sure indexes pair with number of rows
#        for index, row in metadata2.iterrows():
#            # grab the run accessions for a given paired metagenome and metatranscriptome
#            mtx_run_accession = row['mtx_run_accession']
#            mgx_run_accession = row['mgx_run_accession']
#            # using the run assession, create paths of the signatures
#            mtx_sigfile = os.path.join('outputs/sourmash_sketch', mtx_run_accession + '.sig')
#            mgx_sigfile = os.path.join('outputs/sourmash_sketch', mgx_run_accession + '.sig')
#            # create a name that reflects the signature contents
#            name = row['sample_name']
#            out_filename = os.path.join('outputs/sourmash_sketch_subtract/' + name + "_k" + ksize + ".sig")
#            print("subtracting!")
#            shell("sourmash sig subtract -k {ksize} -o {out_filename} -A {mtx_sigfile} {mtx_sigfile} {mgx_sigfile}")

# This is not slow
rule sig_subtract_by_cli:
    input:
        mtx_sig = 'outputs/sourmash_sketch/{mtx_run_accession}.sig',
        mgx_sig = 'outputs/sourmash_sketch/{mgx_run_accession}.sig',
    output: "outputs/sourmash_sketch_subtract/{mtx_run_accession}-minus-{mgx_run_accession}-k{ksize}.sig"
    benchmark: "benchmark_cli_full.txt"
    shell:'''
    sourmash sig subtract -k {wildcards.ksize} -o {output} -A {input.mtx_sig} {input.mtx_sig} {input.mgx_sig}
    '''
ctb commented 2 years ago

Put files here, https://github.com/ctb/2022-sourmash-slow-sig-subtract, and fixed them a bit so they'd run 😜.

ctb commented 2 years ago

I can't replicate with downsampled sigs (to scaled=1000, k=31 only) - I see

% /usr/bin/time snakemake -j 1 outputs/sourmash_sketch_subtract/C_2_1_k31.sig
...
133.06user 4.60system 2:24.23elapsed 95%CPU

for the Python API version, and

% /usr/bin/time snakemake -j 1 outputs/sourmash_sketch_subtract/SRR6032600-minus-SRR6032602-k31.sig
...
131.46user 5.15system 2:26.10elapsed 93%CPU

so, basically identical.

I will note that in the Python version, you don't need to reload mtx_sigobj as abund_sig. This was true even before we implemented FrozenMinHash in https://github.com/sourmash-bio/sourmash/pull/1508 and FrozenSourmashSignature in https://github.com/sourmash-bio/sourmash/pull/1610, but now any attempt to in-place modify a signature or minhash loaded from a file will barf on you! So this is a potential speed up.

I will now re-run with the scaled=200 sketches and see what happens.

ctb commented 2 years ago

tl;dr no substantial difference in time or maxresident, but a massive difference in inputs (!?) - which is file system inputs. WTH?

from man time on farm - "Number of file system inputs by the process."

and a lot more page faults.

huh.

python

% /usr/bin/time snakemake -j 1 outputs/sourmash_sketch_subtract/C_2_1_k31.sig
...
5973.95user 50.38system 1:40:46elapsed 99%CPU (0avgtext+0avgdata 10888868maxresident)k
8277424inputs+235944outputs (436major+23010636minor)pagefaults 0swaps

shell

% /usr/bin/time snakemake -j 1 outputs/sourmash_sketch_subtract/SRR6032600-minus-SRR6032602-k31.sig
...
5871.86user 49.32system 1:38:48elapsed 99%CPU (0avgtext+0avgdata 10843796maxresident)k
1280inputs+235936outputs (4major+23577833minor)pagefaults 0swaps
ctb commented 2 years ago

I can replicate the inputs differences with a much smaller query - makes me suspect that time is only counting inputs in the primary process, i.e. snakemake, so when Python runs within snakemake, we're seeing the input count - but not in the shell situation.

I'll think about it some more. But for now I cannot reproduce the performance delta itself.