Removing unique k-mers between samples

jessicalumian commented 2 years ago

Hello! I am working on replicating some (amazing) work from Taylor's IBD project.

I have signatures from sourmash sketch dna of metagenome samples. I would like to remove k-mers that are only present in 1 sample with the hope this will reduce noise in comparison between samples.

It looks like Taylor does this here.

Would it be recommended to phagocytize this code, or write something else? Thank you!

ctb commented 2 years ago

That code looks good to me!

Note that the output is a text file with just the relevant hashes in it (and not their counts); if you need a different format, e.g. something suitable for use with sourmash sig inflate, lmk.

jessicalumian commented 2 years ago

Ok cool! Because this is dealing with hashes, does that mean that k-mer abundance information is retained? Or would that require sourmash sig inflate?

I'm not totally clear on how inflate works. Is this correct:

Each sample has its own signature file. sample_A.sig has a hash is specific to the abundance of a k-mer. If sample_B.sig has that same hash, then sourmash sig inflate would represent that hash in the signature of sample B?

ctb commented 2 years ago

Let me see if this response helps - ask away if not :)

when signatures are calculated with sketch dna -p abund the signature stores both a hash (representing a k-mer) and an abundance (representing the multiplicity of that k-mer). Some operations can be done only on signatures with abundances, others can be done only on signatures without abundances, and some operations can be done on both.

sourmash sig flatten removes abundances.

sourmash sig inflate transfers abundances from one signature onto the hashes in another, where those hashes are in common; hashes that are not present in the intersection of the abund signature and the flat signature, they are ignored.

An exercise you might try - sourmash sig flatten a.sig -o b.sig then sourmash sig inflate a.sig b.sig -o c.sig should produce a c.sig file that is equivalent to a.sig. Although... it's actually kind of hard to track and diagnose abundances and so on. Let us know what would help you debug things, as you proceed!

what I did to verify that `a.sig` and `c.sig` are identical -

I ran sourmash search a.sig b.sig and sourmash search a.sig c.sig. The a-vs-b failed because b didn't have abundance, the a-vs-c suceeded with a 100% similarity.

other tricks

sourmash sig describe will tell you if a signature has abundances.

ctb commented 2 years ago

it's not hard to change the code in @taylorreiter Snakefile to output an abundance sig. I'll do that when I get a chance!

ctb commented 2 years ago

Check this 'un out -

https://github.com/ctb/2022-sourmash-filter-min-samples

ctb commented 2 years ago

(some things to be done to improve, but it's a start!)

jessicalumian commented 2 years ago

That explanation helps a lot!! I'm trying to picture a use case for inflate and flatten. The goal to be able to add or take away abundance information to allow you to do sourmash operations that are dependent on sigs being inflated or flat? Is there every a use case where you would inflate the abundance of one sample's signature onto a separate sample's flat signature? (I can't think of a reason for this, and this was my mental block before your explanation.)

Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right?

ctb commented 2 years ago

That explanation helps a lot!! I'm trying to picture a use case for inflate and flatten. The goal to be able to add or take away abundance information to allow you to do sourmash operations that are dependent on sigs being inflated or flat? Is there every a use case where you would inflate the abundance of one sample's signature onto a separate sample's flat signature? (I can't think of a reason for this, and this was my mental block before your explanation.)

flatten is important for situations where you need to or want to get rid of abundances - there are various other sourmash commands that require flat signatures. Like, you have to run gather against genomes, which have to be flat. This in turn is so that we can have error messages that say "what you're doing doesn't make sense/we don't have any way to do that" as a way to guide the user that Here Be Dragons.

inflate has come up mostly in situations like the one you're facing here -

I select hashes based on some weird criterion like "all hashes that appear in more than one file", in which it doesn't make sense to keep abundances.
so I calculate that, and then intersect all my individual signatures with those hashes.
but now I want the abundances back for the hashes that I kept!! please help!

Basically there are weird things power users sometimes want to do and we're trying to enable them in a sensible way where common things are actually available via the sourmash command line, and thus power users only need to do very oddball things via scripting.

tl;dr no one clear use case, but many fuzzy ones, so we supported it :)

Thank you for the python code!!! sneaking in a snakemake question: the best practice would be to have this script in the directory with my snakefile and call it in a rule as opposed to putting this code directly in the snakefile, right?

yep! it's not always clear it matters but if it starts as a separate script then might as well keep it that way!

ctb commented 2 years ago

Note to self: sourmash sig export and import per https://github.com/sourmash-bio/sourmash/issues/1098 would help demonstrate/check/evaluate abundance stuff in signatures.

ctb commented 2 years ago

(Updated the git repo with comments and README.)

ctb commented 2 years ago

Thinking of integrating this into sourmash sig as either commonhash or occurrences - any thoughts?

taylorreiter commented 2 years ago

I love the idea of integrating this into sourmash sig! I saw commonhash and liked it -- i think as long as it's documented then occurrences is also fine!

@jessicalumian -- I used the code you linked above to create a signature that had all of the hashes that I was interested in keeping. Then, I took each of my metagenome signatures and used sourmash sig intersect to only keep the hashes of interest. There are definitely a bunch of different ways to achieve the same thing! Just wanted to add a note about the approach i took.

jessicalumian commented 2 years ago

I ran the code from Titus and I have 1 giant signature of everything I want to keep from all my samples (woo!). Because I'm never out of questions:

If I use sourmash sig intersect I see that abundances will be lost. But if I use --abundance-from, the abundances will be transferred from my 1 giant signature to the individual signatures that are made. Is this correct? (Or, I could use sourmash sig inflate to transfer abundances.)

Does it make sense to retain abundances for samples that have unique k-mers removed? It would be nice to see if microbe X is highly abundant in one sample and low in another. It seems like the removal of unique k-mers wouldn't affect this comparison because abundance was calculated from the original metagenome, right?

ctb commented 2 years ago

so I was worried this would happen 😆

you should actually have one gigantic file containing many signatures! Try running sourmash sig summarize <filename>.

You can use sourmash sig split to split those out into individual sig files, or output them directly that way using a directory output, e.g. -o foo/. See output formats for more info!

jessicalumian commented 2 years ago

ooooh ok I might actually have 1 file with multiple signatures! This is the output of sourmash sig summarize

** loading from 'kmers_21_in_multiple_samples_with_abundance.sig'
path filetype: MultiIndex
location: kmers_21_in_multiple_samples_with_abundance.sig
is database? no
has manifest? yes
num signatures: 20
** examining manifest...
total hashes: 1243224
summary of sketches:
   20 sketches with DNA, k=21, scaled=2000, abund     1243224 total hashes

This would indicate that I actually have 1 file containing my 20 signatures (I had 20 input files in this case), right? After running sourmash sig split I now have 20 signature files, which are the signatures from my initial samples with unique k-mers removed.

The names have a hash and some stuff (like the example here), but I can rename them with sourmash sig rename.

Sourmash: There's a command for that. :muscle: :sparkles:

taylorreiter commented 2 years ago

(this is so awesome)

ccbaumler commented 2 years ago

For clarity, this workflow is:

sourmash sketch dna -p abund *.fq.gz followed by ./filter-min-samples.py *.sig -o filtered.zip or ./filter-min-samples.py *.sig -o ./output/ (from this repo). This will output a collection of signatures of only commonly shared hashes across the intersect of all individual signatures included while retaining the abundances stored in each individual signature.

Correct?

jessicalumian commented 2 years ago

yep! then sourmash sig split to split into separate signature files, then renaming sig files if desired. so:

create signatures from metagenomes with abundance tracking sourmash sketch dna -p abund *.fq.gz
remove k-mers only present in 1 sample with filter-min-samples.py to create frakensignature file
split up frakensignature file into individual sig files with sourmash sig split {input} -gz -o outdir (I haven't figured out the smart way to do this in a snakefile yet because the output names are hashes and other info (example), do for now I defined my output as the directory that will be created

ctb commented 2 years ago

(very evocative names...)

you can actually do a lot of stuff with the frakensignature file, which is pretty snakemake friendly. Question is what you want to actually do :).

note that the sketches should still be named by whatever their sample name is; see sourmash sig describe <frakensignature.zip>. Then sourmash sig cat --include <name> -o <name>.sig.gz will split out the names etc.

(Basically, I try to not using named .sig files if I can avoid it, because the .zip is so much more convenient. But it might require retooling some of the specific workflow steps. not sure.)

ccbaumler commented 1 year ago

Wild idea. Thinking about Branchwater in relation to this technique. We have attempted to extract metagenome sequences from the SRA with random forest classified signatures. That was not as informative as we had hoped with that particular search.

Could we instead identify a set of metagenomes of a particular phenotype (e.g. IBD) and isolate the shared common kmers of that phenotype? Subtracting a control phenotype common kmer set from the case phenotype signature either before or after creating the case common kmer sig.

Edit: Or, in addition to subtracting control phenotype common hash sets, we could subtract unique cases from each ohter as well. Edit2: I suppose you could call this identifying the core genomic components of phenotypic metagenome data

taylorreiter commented 1 year ago

So you're suggesting:

Find IBD metagenomes
Find other gut metagenomes from non-ibd data sets
from the IBD metagenomes, subtract hashes that are in signatures of the other gut metagenomes
do stuff with the leftovers

ccbaumler commented 1 year ago

4a. branchwater search using the core genomic components of each common hash set and compare the results

ctb commented 1 year ago

(I'm tempted to suggest we move this to a new issue, since it's diverging pretty far from the issue title - but we can do that later I guess ;)

I would be very interested also/separately in calculating Shannon entropy / information with respect to metadata. I have code 🤷

ctb commented 1 year ago

Trying this out with the CLI plugin infrastructure https://github.com/sourmash-bio/sourmash/pull/2438 - see PR https://github.com/ctb/2022-sourmash-filter-min-samples/pull/1.

Kinda neat - when all the machinery works, you get the ability to run:

% sourmash scripts commonhash ../sourmash/podar-ref/*.sig  -o xxx.sig

ctb commented 1 year ago

The code has now been moved from https://github.com/ctb/2022-sourmash-filter-min-samples to https://github.com/ctb/sourmash_plugin_commonhash.

Leaving this issue open because it has a lot of good discussion that we should put in advanced documentation or something.

mr-eyes commented 10 months ago

I think this fits here. https://github.com/mr-eyes/sourmash_plugin_extract_errors

sourmash-bio / sourmash