Closed anmwinter closed 2 years ago
Hi ara,
yes, it's easy to ask for the union or intersection of hashes for any cluster. i've done this a fair bit. At what level are you working - command line or python?
thx, --titus
Titus,
I installed sourmash through pip. I am currently running it by the command line through a jupyter notebook. Is getting the union easier by running it through python?
Thanks! ara
Right, it's not a built in feature at the command line interface, but it's relatively easy to do via Python.
Can you provide me with an example of the sort of workflow you want to use?
e.g.
The work flow you describes is pretty much what I am looking for:
calculate signatures for a bunch of sequences cluster signatures at some threshold retrieve all signatures that cluster with a specific query signature build union or intersection of signatures within a cluster
Longer term I'd like to be able to see if there is a signature that occurs across all samples. I am trying to sort out the species signatures and any geographic signatures. Currently our metagenomes are clustering by bat species with some exceptions.
Does sourmash use the same procedure that Mash uses to find similar hashes? And if so is that part coded in python?
One thing I wanted to try to code for was a table of "fuzzy" hashes that occur in each sample. fuzzyhash1 fuzzyhash2 fuzzyhash3 bat1 4 1 0 bat2 8 0 0 bat3 3 2 3
Are signatures and hashes the same thing?
On Fri, Jul 29, 2016 at 09:55:22AM -0700, Ara Winter wrote:
Are signatures and hashes the same thing?
Here's how I'm using the terms:
Hash: individual k-mer
Signature: collection of hashes
On Fri, Jul 29, 2016 at 09:30:15AM -0700, Ara Winter wrote:
The work flow you describes is pretty much what I am looking for:
calculate signatures for a bunch of sequences cluster signatures at some threshold retrieve all signatures that cluster with a specific query signature build union or intersection of signatures within a cluster
ok! I'm not sure if I'll get to it this week but please do bump this issue in a week or so.
Longer term I'd like to be able to see if there is a signature that occurs across all samples. I am trying to sort out the species signatures and any geographic signatures. Currently our metagenomes are clustering by bat species with some exceptions.
ok - I can give you reasons why it might not work, but it's worth a try!
Does sourmash use the same procedure that Mash uses to find similar hashes? And if so is that part coded in python?
Yes (it's mash compatible) and no (not coded in python). It used to be and I could put together a Python description of the algorithm if you like.
One thing I wanted to try to code for was a table of "fuzzy" hashes that occur in each sample. fuzzyhash1 fuzzyhash2 fuzzyhash3 bat1 4 1 0 bat2 8 0 0 bat3 3 2 3
Would the fuzzyhash1 / fuzzyhash2 lists of hashes come from some sort of clustering or grouping of hashes in the signatures?
ok - I can give you reasons why it might not work, but it's worth a try!
Oh, I'd like to hear why this might not work. I've read through the Mash paper and I am still trying to wrangle with the concepts in there.
Would the fuzzyhash1 / fuzzyhash2 lists of hashes come from some sort of clustering or grouping of hashes in the signatures?
Yes, I was imagining a clustering plus picking a representative hash (similar to 16S OTU clustering).
I am in my second week of my post-doc and I have some time to develop/use new tools. Using signatures is at the top of my list since I stumbled across sourmash. I have a few other questions that I will start another thread for.
On Mon, Aug 01, 2016 at 07:30:31AM -0700, Ara Winter wrote:
ok - I can give you reasons why it might not work, but it's worth a try!
Oh, I'd like to hear why this might not work. I've read through the Mash paper and I am still trying to wrangle with the concepts in there.
Basically, the hashes in the signature give you extraordinarily sensitive ability to detect similar species, but this falls off quickly as species diverge. The MetaPalette paper (http://msystems.asm.org/content/1/3/e00020-16) gives some good input here wrt to k-mer sizes and species/strain divergence.
So I'd worry about moderately distant genomes being completely disjoint in signature space.
Would the fuzzyhash1 / fuzzyhash2 lists of hashes come from some sort of clustering or grouping of hashes in the signatures?
Yes, I was imagining a clustering plus picking a representative hash (similar to 16S OTU clustering).
You'd probably want to work with as many hashes as possible, for sensitivity raesons.
I am in my second week of my post-doc and I have some time to develop/use new tools. Using signatures is at the top of my list since I stumbled across sourmash. I have a few other questions that I will start another thread for.
ok! note that the YAML signature files are easy to parse with many languages, and the overall idea is surprisingly trivial, so you could easily develop your own code to work with the output of sourmash - I'd go with what you're comfortable with rather than relying too heavily on this code too much :)
Thanks @ctb ! I will read through the MetaPalette paper later today.
I just wrote a little python script to parse the YAML signature files so I could start hacking away.
So I'd worry about moderately distant genomes being completely disjoint in signature space.
So if you have a decently diverse metagenome, this same issue would crop up? Does increasing the number of hashes help with this?
very cool! if you want to share at some point it could be useful to others (or you can tell me what I can provide through this project's docs to help people like you in the future!)
very cool! if you want to share at some point it could be useful to others (or you can tell me what I can provide through this project's docs to help people like you in the future!)
Gladly! Right now it's just parsing one file. I need to fix it so it loops through all the .sig files. I am not the best at using github. So what is a good way to share the notebook with you through github?
Thanks again.
Diverse metagenome: yes, same issue.
Increasing number of hashes: no, probably not. Haven't found anything that does work (and I don't think I will).
Morning @ctb I thought I would give the the union hashes a little bump here.
What is the commands for running sourmash through python? I saw a few .py files in the repo.
thanks! ara
Documented all of this over in the API docs a while back, closing!
https://sourmash.readthedocs.io/en/latest/api-example.html#set-operations-on-hashes
Hello,
Hopefully I am using the right words here. Is there a way to get at which minhashes are driving each split in the cluster (or the larger groups). I went through the Hash website and I think I understand how each hash is being created.
Ideally I'd like to grab a set of hashes that drives the split in the cluster and then run those against the RefSeq.msh to see what they are.
thanks, ara