sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

Higher-order "compare" method with zeta diversity metric #1189

Closed nmb85 closed 4 years ago

nmb85 commented 4 years ago

Hi, I am interested in using sourmash to compare entire metagenomes and hierarchically cluster them based on diversity metrics. There is new work with the "zeta-diversity" metric suggesting that changing the "zeta order" has an effect on understanding drivers of microbial bio-geography and turnover of rare versus abundant species. The familiar diversity metrics are alpha diversity, which has a zeta order of 1 because it is the number of species in a single sample, and beta diversity, which has a zeta order of 2 because it is the number of species (not) shared by two samples, and diversity metrics with higher zeta orders follow this pattern for species shared by three samples, etc. There is an illustration of this in Figure 5 of this paper, which is an example of the zeta diversity metric applied to soil microbiomes (both 16S surveys and shotgun metagenomes). Is it possible to calculate diversity metrics with zeta orders higher than 2 using sourmash signatures?

luizirber commented 4 years ago

Did a quick skim on the paper, and I think this is possible with the info we already output from sourmash gather (if you do the classification first and use matches to calculate diversity).

A possible extension would be to do the zeta diversity on the hashes, and that would require a bit more of code (but is doable with the LCA index, which already has the mapping from hash to which signatures contain the hash). I'm not sure how robust the results would be... but worth trying it out =]

nmb85 commented 4 years ago

Hi, @luizirber, thanks for getting back to me. I am most interested in doing the calculation on the hashes because I do not want database limitations to skew results. I was able to figure out how to pull the hashes out of the signatures via the Python API and then do simple set operations on them this way:

# load modules for working with signatures, building set combinations, and set operations
import glob, sourmash, itertools, time

# get signature paths
mgsig_files = glob.glob(paths_to_sigs)

# load signatures into list
mgsigs = []
for mgsig_file in mgsig_files:
    mgsigs.append(sourmash.load_one_signature(mgsig_file, ksize = 31))

# get the hash lists from the signatures and store in a dictionary
mghashsets = {}
for mgsig in mgsigs:
    mghashset = mgsig.minhash.get_hashes()
    sample = mgsig.name().split("/")[1] # this line is specific to my input file paths and just gets the metagenome sample name to use as dictionary keys
    mghashsets[sample] = set(mghashset)

# get the different combinations of hash sets
intersect_size = 3
combos = list(itertools.combinations(mghashsets,intersect_size))

# calculate intersections between sets of hashes, this takes longer with more sets in each intersection (intersect_size), 
# and becomes difficult for a single node at an intersect_size of 6-8 or so for 40-50 metagenome signatures with < 10k hashes per signature

start = time.time()
intersects = {}

for combo in combos:
    hashsets = [ mghashsets[sample] for sample in combo ]
    intersects[combo] = set.intersection(*hashsets)
end = time.time()
print(end-start)

From these intersections of multiple hash sets, I can calculate the statistics I need for the zeta diversity decay curve. I can also determine via sourmash gather which species are shared by the signature hash sets in the intersection. Is this sound? And can you suggest a more efficient way to calculate these intersections (in Python)? Something I might be able to try myself as not to burden you and your team?

Mind you, this approach of measuring shared species among metagenomes at different levels of intersection is analogous to the pangenome paradigm of measuring shared genes among genomes at different levels of intersection.

nmb85 commented 4 years ago

Nevermind, @luizirber . It is much more efficient to just create a table of the metagenome signature names, the hashes, and even the hash abundances like this:

# load modules for working with signatures, building set combinations, and set operations
import glob, sourmash, pandas as pd

# get signature paths
mgsig_files = glob.glob(paths_to_sigs)

# load signatures into list
mgsigs = []
for mgsig_file in mgsig_files:
    mgsigs.append(sourmash.load_one_signature(mgsig_file, ksize = 31))

# extract the metagenome signature names, hashes, and counts into nested dictionary
sample_dict = {}
for mgsig in mgsigs:
    mins = mgsig.minhash.get_mins(with_abundance=True)
    sample = mgsig.name().split("/")[1]
    sample_dict[sample] = mins

# convert the nested dictionary into a pandas dataframe
sample_ids = []
min_list = []

for sample_id, mins in sample_dict.items():
    sample_ids.append(sample_id)
    min_list.append(pd.DataFrame.from_dict(mins, orient="index",columns=["abundance"]))

df = pd.concat(min_list, keys=sample_ids)

# reformat the pandas dataframe a bit and then save to disk as csv file to work with later/import into R
df.reset_index(inplace=True)
df = df.rename(columns={'level_0': 'experiment', 'level_1': 'hash'})
df = df.drop(columns="index")
df.to_csv(path_to_csv_file, sep = "\t", index_label = False)

Then I can calculate all the statistics I want from the table/matrix really quickly. Using itertools to create all combinations of signatures with shared hashes is a waste of time. Am I reinventing the wheel; does sourmash already have a method to save the signature names, hashes, and hash counts for a set of signatures to a tabular file?

ctb commented 4 years ago

congratulations, you have leveled up to sourmash Power User level 3.

:) :)

thank you for the code!

luizirber commented 4 years ago

Then I can calculate all the statistics I want from the table/matrix really quickly. Using itertools to create all combinations of signatures with shared hashes is a waste of time. Am I reinventing the wheel; does sourmash already have a method to save the signature names, hashes, and hash counts for a set of signatures to a tabular file?

We didn't have a method before, but now we do =] This is beautiful!

taylorreiter commented 4 years ago

I've been converting signatures to CSVs and then merging the CSVs into a single table, backfilling with zeros when the hash is not observed in a signature. Very time consuming...this is much better!!!

nmb85 commented 4 years ago

@ctb, you're welcome for the code. I'm just happy to give back in a small capacity to a community that's helping me :smile:

@luizirber, ah, good!

@taylorreiter oh, cool, nice to know someone else is traveling the same path; I have a спутник (a satellite, yes, but also means a fellow traveler), haha! Hey, do you know how to efficiently convert a pandas dataframe from long form like above to wide form (a matrix with hashes along one axis, sample ids along another axis, and abundances filling the matrix)? I usually just jump straight into R with tabular data, so I'm not savvy with pandas dataframes.

taylorreiter commented 4 years ago

@nmb85 I ended up with this implementation:

rule make_hash_abund_table_wide:
    input: "outputs/hash_tables/normalized_abund_hashes_long.csv"
    output: "outputs/hash_tables/normalized_abund_hashes_wide.feather"
    run:
        import pandas as pd
        import feather

        ibd = pd.read_csv(str(input), dtype = {"minhash" : "int64", "abund" : "float64", "sample" : "object"})
        ibd_wide=ibd.pivot(index='sample', columns='minhash', values='abund')
        ibd_wide = ibd_wide.fillna(0)
        ibd_wide['sample'] = ibd_wide.index
        ibd_wide = ibd_wide.reset_index(drop=True)
        ibd_wide.columns = ibd_wide.columns.astype(str)
        ibd_wide.to_feather(str(output))

but this operation gave me a non-trivial amount of problems. With larger matrices, this code will give Unstacked DataFrame is too big, causing int32 overflow

@luizirber and I came up with a similar implementation that uses dask:

import sys

import dask.dataframe as dd
import feather

ibd = dd.read_csv(sys.argv[1], dtype = {"minhash" : "category", "abund" : "float64", "sample" : "str"})
ibd["minhash"] = ibd["minhash"].cat.as_known()
ibd_wide=ibd.pivot_table(index='sample', columns="minhash", values='abund')
ibd_wide = ibd_wide.fillna(0)

ibd_wide.columns = list(ibd_wide.columns)
ibd_wide = ibd_wide.reset_index()
ibd_wide.columns = ibd_wide.columns.astype(str)
ibd_wide.compute().to_feather(sys.argv[2])

However this could also be problematic, so parquet could also be useful as an alternative.

Some links:

nmb85 commented 4 years ago

Thank you, @taylorreiter! This is so useful and saves me lots of time! It looks like I'll pick up a few more Python modules over the next week!

ctb commented 4 years ago

hi folks, do y'all think the code above would fit in well with an upgraded sourmash sig export command? See also #1098

ctb commented 3 years ago

revisiting this - right now I'm hesitant to add pandas as an installation requirement, so am going to pass on this for 4.0. maybe for 5.0 tho!