Open awezmm opened 2 years ago
Hi @awezmm take a look at the attached python script.
I ran it like so:
./find-discrim-kmers.py podar-ref.lca.json.gz Archaea -o xyz.sig --save-names xyz
and then did:
sourmash sig kmers --signatures xyz.sig --sequence podar-ref/3.fa --save-kmers out.csv
to get the k-mers behind the hashes. This step was ...remarkably slow, which is something we will need to work on.
I should note that this doesn't give you all the lineage-discriminatory k-mers, either; it only gives you the k-mers matching to hashes that are discriminatory. A general solution to pick out all the k-mers that are specific to a set of lineages is something that's out of scope for sourmash, which largely deals with downsampled collections of hashes. Doing this for all k-mers is something that kProcessor should be able to do if you're interested - cc @mr-eyes. But you'd have to do your own taxonomic filtering.
Hi,
I've built a custom LCA database (using signature and taxonomy information) and now I have a lca.json file. I would like to extract "genus-specific" kmers (the discriminative kmers for each genus). Is there any example of how I could go about this? I would like to output the actual kmers (not hashes).
Thanks so much!