sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

workflow to convert (many) .sig/.sig.gz files to .sig.zip files + mf.csv files #3349

Open ctb opened 1 month ago

ctb commented 1 month ago

on farm at /home/ctbrown/scratch/2022-branchwater-benchmarking/wort-list-d.d/Snakefile -

# convert a bunch of .sig files into .sig.zip files and also produce .mf.csv fil
es.

FILELIST='../data/wort-list-d.txt'

siglist = [ x.strip() for x in open(FILELIST) ]
print(f"loaded '{len(siglist)}' files")

#print('selecting 10...')
#siglist = siglist[:10]

ACCS = [ os.path.basename(x).split('.')[0] for x in siglist ]

rule all:
    input:
        expand('{acc}.sig.zip', acc=ACCS),
        expand('{acc}.mf.csv', acc=ACCS)

rule make_sig_zip:
    output: "{acc}.sig.zip"
    shell: """
        sourmash sig cat /group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs/{w
ildcards.acc}.sig -o {output}
    """

rule make_mf_csv:
    input: "{acc}.sig.zip",
    output: "{acc}.mf.csv",
    shell: """
        sourmash sig collect {input} -o {output} -F csv --abspath
    """
ctb commented 1 month ago

to produce a standalone manifest from the mf.csv files (and also probably from the zip files), do

sourmash sig collect -F csv *.mf.csv -o combined.mf.csv --abspath