sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
15 stars 2 forks source link

`index` doesn't work with a text file list of manifests #347

Open olgabot opened 3 months ago

olgabot commented 3 months ago

Hello, hope you are well!

I am very excited to try out the low-memory and fast searches created by RocksDB :) (Also, I will definitely be making use of pairwise!)

On my way there, I encountered some unexpected behavior. I had an enormous sequence file (e.g. UniRef50, 65M protein sequences) and cut it up into chunks of 100k sequences to do sourmash scripts manysketch -p protein,scaled=1,k=10,abund without running out of resources.

Then, I wanted to index these many files before searching them, but sourmash scripts index didn't work on a list of manifest files.

Here's a minimal reproduction, using the data in src/python/tests/test-data:

# Make input csv files
echo 'name,genome_filename,protein_filename\nshort,short.fa,' > short.csv 
echo 'name,genome_filename,protein_filename\nshort,short2.fa,' > short2.csv
echo 'name,genome_filename,protein_filename\nshort,short3.fa,' > short3.csv

# Make sketches
sourmash scripts manysketch short.csv -o short.fa.zip -p dna,k=31,scaled=1 
sourmash scripts manysketch short2.csv -o short2.fa.zip -p dna,k=31,scaled=1
sourmash scripts manysketch short3.csv -o short3.fa.zip -p dna,k=31,scaled=1

# Make list of sketches (but they're actually manifests?)
for ZIP in short*.zip; do echo $ZIP >> short_siglist.txt; done

Then, sourmash scripts index fails

$ sourmash scripts index --ksize 31 --scaled 1 -o short_index.rocksdb short_siglist.txt   

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1 / moltype: DNA 
indexing all sketches in 'short_siglist.txt'
Loading siglist
Reading signature(s) from: 'short_siglist.txt'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short2.fa.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short.fa.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short3.fa.zip'
No valid signatures found in signature pathlist 'short_siglist.txt'
WARNING: 3 signature paths failed to load. See error messages above.
Error: Signatures failed to load. Exiting.

I'm realizing now that short.zip are manifests and not sigs, but I was confused that sourmash scripts index wasn't able to work with them, because all the parameters matched when doing sourmash sig describe:

$ sourmash sig describe short.fa.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /Users/olgabot/code/sourmash_plugin_branchwater/src/python/tests/test-data/short.fa.zip
signature: short
source file: short.fa
md5: 9191284a3a23a913d8d410f3d53ce8f0
k=31 molecule=DNA num=0 scaled=1 seed=42 track_abundance=0
size: 970
sum hashes: 970
signature license: CC0

loaded 1 signatures total, from 1 files

The workaround is using sourmash sig cat to combine the signatures into one file, but I was hoping not to do this until index creation since the input files are so big.

sourmash sig cat short*.zip -o combined_short.zip 
sourmash scripts index combined_short.zip --ksize 31 --scaled 1 -o short_index.rocksdb 

Let me know if I'm not thinking about this problem correctly and there's a better way to do it.

Hope this was informative! Thank you!

ctb commented 3 months ago

you are exactly right... they are not yet supported but rather desperately needed (see https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/266 and https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/235).

there are a few issues that are likely to take priority over upgrading this behavior - in particular, https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/322 and https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/331 are top of my mind right now - but your use case is really important functionality that we hope to implement soon.

ctb commented 3 months ago

(and yes, I think the documentation is also broken around this behavior. To quote Napoleon, “You can ask me for anything you like, except time” 😭 )

ctb commented 3 months ago

364 "fixes" the documentation by commenting out the manifest CSV recommendations until we can support them.