sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
446 stars 78 forks source link

enable protein sketches for pangenome hash correlations #3201

Open AnneliektH opened 3 weeks ago

AnneliektH commented 3 weeks ago

calc-hash-presence.py from pangenome-hash-cor, does not yet allow me to compare protein sketches instead of nucleotide sketches.

I'd like to compare protein-pangenomes instead of nucleotide ones

Trying in /group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/pangenomics/test_virpan

python ../../2024-pangenome-hash-corr/calc-hash-presence.py \
cluster1371.ranktable.csv \
cluster1371.zip \
-o cluster1371.dump \
--scaled=1

loaded 204371 hashvals... downsampling soon.
found 0 metagenomes
Traceback (most recent call last):
  File "/group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/pangenomics/test_virpan/sigs/../../2024-pangenome-hash-corr/calc-hash-presence.py", line 80, in <module>
    sys.exit(main())
  File "/group/ctbrowngrp2/scratch/annie/2023-swine-sra/sourmash/pangenomics/test_virpan/sigs/../../2024-pangenome-hash-corr/calc-hash-presence.py", line 37, in main
    query_minhash = next(iter(idx.signatures())).minhash.copy_and_clear()
StopIteration
ctb commented 2 weeks ago

just added it now - https://github.com/ctb/2024-pangenome-hash-corr/commit/7440f8ffa4288119750b3d32f8ae458cd2ad8a74

you'll need to update your sourmash_plugin_pangenomics first, either to the latest released version or the latest dev version.

ctb commented 2 weeks ago

(let me know if it breaks ;)

ctb commented 2 weeks ago

hmm, some more things need to be fixed before it will work. sorry!

ctb commented 2 weeks ago

ok fixed in 2024-pangenome-hash-corr https://github.com/ctb/2024-pangenome-hash-corr/commit/f2433f24828d59f1c308c3ecd711b897cf5756fd and v0.2.2 of sourmash_plugin_pangenomics, I think.

you may need to use --protein --no-dna with the pangenome plugin scripts. You'll also need to specify the k-mer size and scaled value far too many times. Sorry, work in progress ;).