sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

Extracting genus-specific kmers from LCA database #1784

Open awezmm opened 2 years ago

awezmm commented 2 years ago

Hi,

I've built a custom LCA database (using signature and taxonomy information) and now I have a lca.json file. I would like to extract "genus-specific" kmers (the discriminative kmers for each genus). Is there any example of how I could go about this? I would like to output the actual kmers (not hashes).

Thanks so much!

ctb commented 2 years ago

Hi @awezmm take a look at the attached python script.

I ran it like so:

./find-discrim-kmers.py podar-ref.lca.json.gz Archaea -o xyz.sig --save-names xyz

and then did:

sourmash sig kmers --signatures xyz.sig --sequence podar-ref/3.fa --save-kmers out.csv

to get the k-mers behind the hashes. This step was ...remarkably slow, which is something we will need to work on.

I should note that this doesn't give you all the lineage-discriminatory k-mers, either; it only gives you the k-mers matching to hashes that are discriminatory. A general solution to pick out all the k-mers that are specific to a set of lineages is something that's out of scope for sourmash, which largely deals with downsampled collections of hashes. Doing this for all k-mers is something that kProcessor should be able to do if you're interested - cc @mr-eyes. But you'd have to do your own taxonomic filtering.

find-discrim-kmers.py.txt