Open jessicalumian opened 2 years ago
That code looks good to me!
Note that the output is a text file with just the relevant hashes in it (and not their counts); if you need a different format, e.g. something suitable for use with sourmash sig inflate
, lmk.
Ok cool! Because this is dealing with hashes, does that mean that k-mer abundance information is retained? Or would that require sourmash sig inflate
?
I'm not totally clear on how inflate
works. Is this correct:
Each sample has its own signature file. sample_A.sig has a hash is specific to the abundance of a k-mer. If sample_B.sig has that same hash, then sourmash sig inflate
would represent that hash in the signature of sample B?
Let me see if this response helps - ask away if not :)
when signatures are calculated with sketch dna -p abund
the signature stores both a hash (representing a k-mer) and an abundance (representing the multiplicity of that k-mer). Some operations can be done only on signatures with abundances, others can be done only on signatures without abundances, and some operations can be done on both.
sourmash sig flatten
removes abundances.
sourmash sig inflate
transfers abundances from one signature onto the hashes in another, where those hashes are in common; hashes that are not present in the intersection of the abund signature and the flat signature, they are ignored.
An exercise you might try - sourmash sig flatten a.sig -o b.sig
then sourmash sig inflate a.sig b.sig -o c.sig
should produce a c.sig
file that is equivalent to a.sig
. Although... it's actually kind of hard to track and diagnose abundances and so on. Let us know what would help you debug things, as you proceed!
a.sig
and c.sig
are identical -I ran sourmash search a.sig b.sig
and sourmash search a.sig c.sig
. The a-vs-b failed because b didn't have abundance, the a-vs-c suceeded with a 100% similarity.
sourmash sig describe
will tell you if a signature has abundances.
it's not hard to change the code in @taylorreiter Snakefile to output an abundance sig. I'll do that when I get a chance!
Check this 'un out -
(some things to be done to improve, but it's a start!)
That explanation helps a lot!! I'm trying to picture a use case for inflate
and flatten
. The goal to be able to add or take away abundance information to allow you to do sourmash operations that are dependent on sigs being inflated or flat? Is there every a use case where you would inflate the abundance of one sample's signature onto a separate sample's flat signature? (I can't think of a reason for this, and this was my mental block before your explanation.)
Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right?
That explanation helps a lot!! I'm trying to picture a use case for
inflate
andflatten
. The goal to be able to add or take away abundance information to allow you to do sourmash operations that are dependent on sigs being inflated or flat? Is there every a use case where you would inflate the abundance of one sample's signature onto a separate sample's flat signature? (I can't think of a reason for this, and this was my mental block before your explanation.)
flatten
is important for situations where you need to or want to get rid of abundances - there are various other sourmash commands that require flat signatures. Like, you have to run gather
against genomes, which have to be flat. This in turn is so that we can have error messages that say "what you're doing doesn't make sense/we don't have any way to do that" as a way to guide the user that Here Be Dragons.
inflate
has come up mostly in situations like the one you're facing here -
Basically there are weird things power users sometimes want to do and we're trying to enable them in a sensible way where common things are actually available via the sourmash command line, and thus power users only need to do very oddball things via scripting.
tl;dr no one clear use case, but many fuzzy ones, so we supported it :)
Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right?
yep! it's not always clear it matters but if it starts as a separate script then might as well keep it that way!
Note to self: sourmash sig export
and import
per https://github.com/sourmash-bio/sourmash/issues/1098 would help demonstrate/check/evaluate abundance stuff in signatures.
(Updated the git repo with comments and README.)
Thinking of integrating this into sourmash sig
as either commonhash
or occurrences
- any thoughts?
I love the idea of integrating this into sourmash sig
! I saw commonhash
and liked it -- i think as long as it's documented then occurrences
is also fine!
@jessicalumian -- I used the code you linked above to create a signature that had all of the hashes that I was interested in keeping. Then, I took each of my metagenome signatures and used sourmash sig intersect
to only keep the hashes of interest. There are definitely a bunch of different ways to achieve the same thing! Just wanted to add a note about the approach i took.
I ran the code from Titus and I have 1 giant signature of everything I want to keep from all my samples (woo!). Because I'm never out of questions:
If I use sourmash sig intersect
I see that abundances will be lost. But if I use --abundance-from
, the abundances will be transferred from my 1 giant signature to the individual signatures that are made. Is this correct? (Or, I could use sourmash sig inflate
to transfer abundances.)
Does it make sense to retain abundances for samples that have unique k-mers removed? It would be nice to see if microbe X is highly abundant in one sample and low in another. It seems like the removal of unique k-mers wouldn't affect this comparison because abundance was calculated from the original metagenome, right?
so I was worried this would happen 😆
you should actually have one gigantic file containing many signatures! Try running sourmash sig summarize <filename>
.
You can use sourmash sig split
to split those out into individual sig files, or output them directly that way using a directory output, e.g. -o foo/
. See output formats for more info!
ooooh ok I might actually have 1 file with multiple signatures! This is the output of sourmash sig summarize
** loading from 'kmers_21_in_multiple_samples_with_abundance.sig'
path filetype: MultiIndex
location: kmers_21_in_multiple_samples_with_abundance.sig
is database? no
has manifest? yes
num signatures: 20
** examining manifest...
total hashes: 1243224
summary of sketches:
20 sketches with DNA, k=21, scaled=2000, abund 1243224 total hashes
This would indicate that I actually have 1 file containing my 20 signatures (I had 20 input files in this case), right? After running sourmash sig split
I now have 20 signature files, which are the signatures from my initial samples with unique k-mers removed.
The names have a hash and some stuff (like the example here), but I can rename them with sourmash sig rename
.
Sourmash: There's a command for that. :muscle: :sparkles:
(this is so awesome)
For clarity, this workflow is:
sourmash sketch dna -p abund *.fq.gz
followed by ./filter-min-samples.py *.sig -o filtered.zip
or ./filter-min-samples.py *.sig -o ./output/
(from this repo). This will output a collection of signatures of only commonly shared hashes across the intersect of all individual signatures included while retaining the abundances stored in each individual signature.
Correct?
yep! then sourmash sig split
to split into separate signature files, then renaming sig files if desired. so:
sourmash sketch dna -p abund *.fq.gz
filter-min-samples.py
to create frakensignature filesourmash sig split {input} -gz -o outdir
(I haven't figured out the smart way to do this in a snakefile yet because the output names are hashes and other info (example), do for now I defined my output as the directory that will be created(very evocative names...)
you can actually do a lot of stuff with the frakensignature file, which is pretty snakemake friendly. Question is what you want to actually do :).
note that the sketches should still be named by whatever their sample name is; see sourmash sig describe <frakensignature.zip>
. Then sourmash sig cat --include <name> -o <name>.sig.gz
will split out the names etc.
(Basically, I try to not using named .sig files if I can avoid it, because the .zip is so much more convenient. But it might require retooling some of the specific workflow steps. not sure.)
Wild idea. Thinking about Branchwater in relation to this technique. We have attempted to extract metagenome sequences from the SRA with random forest classified signatures. That was not as informative as we had hoped with that particular search.
Could we instead identify a set of metagenomes of a particular phenotype (e.g. IBD) and isolate the shared common kmers of that phenotype? Subtracting a control phenotype common kmer set from the case phenotype signature either before or after creating the case common kmer sig.
Edit: Or, in addition to subtracting control phenotype common hash sets, we could subtract unique cases from each ohter as well. Edit2: I suppose you could call this identifying the core genomic components of phenotypic metagenome data
So you're suggesting:
4a. branchwater search using the core genomic components of each common hash set and compare the results
(I'm tempted to suggest we move this to a new issue, since it's diverging pretty far from the issue title - but we can do that later I guess ;)
I would be very interested also/separately in calculating Shannon entropy / information with respect to metadata. I have code 🤷
Trying this out with the CLI plugin infrastructure https://github.com/sourmash-bio/sourmash/pull/2438 - see PR https://github.com/ctb/2022-sourmash-filter-min-samples/pull/1.
Kinda neat - when all the machinery works, you get the ability to run:
% sourmash scripts commonhash ../sourmash/podar-ref/*.sig -o xxx.sig
The code has now been moved from https://github.com/ctb/2022-sourmash-filter-min-samples to https://github.com/ctb/sourmash_plugin_commonhash.
Leaving this issue open because it has a lot of good discussion that we should put in advanced documentation or something.
I think this fits here. https://github.com/mr-eyes/sourmash_plugin_extract_errors
Hello! I am working on replicating some (amazing) work from Taylor's IBD project.
I have signatures from
sourmash sketch dna
of metagenome samples. I would like to remove k-mers that are only present in 1 sample with the hope this will reduce noise in comparison between samples.It looks like Taylor does this here.
Would it be recommended to phagocytize this code, or write something else? Thank you!