sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
16 stars 3 forks source link

cannot pass directory to multisearch or pairwise #533

Open peterjc opened 1 day ago

peterjc commented 1 day ago

From https://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html#how-do-i-store-and-search-collections-of-signatures I expected to be able to pass a folder name to mean all the signatures in it.

However multisearch appears to get stuck at 99% CPU:

❯ sourmash scripts multisearch -o /dev/stdout --ani signatures/ signatures | wc -l

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: None / moltype: DNA / threshold: 0.01
searching all sketches in 'signatures/' against 'signatures' using 8 threads
estimate ani? True / estimate probability of overlap? False
Reading query(s) from: 'signatures/'
       0
zsh: killed     sourmash scripts multisearch -o /dev/stdout --ani signatures/ signatures |
zsh: done       wc -l

Workaround, combine the signatures first:

❯ sourmash sig cat -o /tmp/pool.sig signatures/*.sig && sourmash scripts multisearch -o /dev/stdout --ani /tmp/pool.sig /tmp/pool.sig | wc -l

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

loaded 3 signatures total, from 3 files
loaded 3 signatures total.
output 3 signatures

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: None / moltype: DNA / threshold: 0.01
searching all sketches in '/tmp/pool.sig' against '/tmp/pool.sig' using 8 threads
estimate ani? True / estimate probability of overlap? False
Reading query(s) from: '/tmp/pool.sig'
Loaded 3 query signature(s)
Setting scaled=300 based on max scaled in query collection
Reading search(s) from: '/tmp/pool.sig'
Loaded 3 search signature(s)
DONE. Processed 9 comparisons
...multisearch is done! results in '/dev/stdout'
      10

It may be the directory approach is not supported, but if so this should abort.

Tested on Intel macOS via conda install, using Python 3.12

peterjc commented 1 day ago

Same with pairwise:

❯ sourmash scripts pairwise -o /dev/stdout --ani signatures/ | wc -l

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: None / moltype: DNA / threshold: 0.01
pairwise-comparing all sketches in 'signatures/' using 8 threads
Reading analysis(s) from: 'signatures/'
       0
zsh: terminated  sourmash scripts pairwise -o /dev/stdout --ani signatures/ |
zsh: done        wc -l
❯ sourmash sig cat -o /tmp/pool.sig signatures/*.sig && sourmash scripts pairwise -o /dev/stdout --ani /tmp/pool.sig | wc -l

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

loaded 3 signatures total, from 3 files
loaded 3 signatures total.
output 3 signatures

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: None / moltype: DNA / threshold: 0.01
pairwise-comparing all sketches in '/tmp/pool.sig' using 8 threads
Reading analysis(s) from: '/tmp/pool.sig'
Loaded 3 analysis signature(s)
Setting scaled=300 based on max scaled in collection
DONE. Processed 3 comparisons
...pairwise is done! results in '/dev/stdout'
       4
peterjc commented 23 hours ago

Not documented as a supported input, so presumably not expected to work:

https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/main/doc/README.md#input-file-formats

ctb commented 23 hours ago

yes, we should exit appropriately :).

Try using:

sourmash sig collect <dir> -F csv -o mf.csv

Then pass mf.csv as input path. You might want to use --abspath.

ctb commented 22 hours ago

Err, you might want to use --abspath when running sig collect.