sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

gnu parallel commands for sourmash operations #2761

Open mr-eyes opened 1 year ago

mr-eyes commented 1 year ago

Multithreaded renaming for all sourmash signatures in a directory to their file names.

ls *.sig | parallel -j 64 'basename={}; basename_without_sig=${basename%.sig}; sourmash sig rename {} $basename_without_sig -o {}'
mr-eyes commented 1 year ago

parallel filtration of sourmash sigs by abundance

FROM_DIR=sigs
TO_DIR=sigs_abund2
ls ${FROM_DIR}/*sig | parallel -j 16 'sig={}; newsig=$(basename $sig .sig); sourmash sig filter -k 51 --min-abundance 2 $sig -o ${TO_DIR}/${newsig}.sig'
mr-eyes commented 1 year ago

parallel downsampling and filtration of sourmash signatures on abundance (piping sourmash commands)

ls sigs/*sig | parallel -j 100 'sig={}; newsig=$(basename $sig .sig); sourmash signature downsample -q $sig --scaled 100000 -k 51 -o - | sourmash signature filter --min-abundance 2 - -o ${newsig}.sig'
ctb commented 1 year ago

Awesome, thanks! But... why such high -j values?? Surely with I/O they merely lead to more thrashing?

mr-eyes commented 1 year ago

Awesome, thanks! But... why such high -j values?? Surely with I/O they merely lead to more thrashing?

I am just showing examples on how to run. However, when I tried once high number of cores (128) it worked very well (for super small sigs).