sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
478 stars 79 forks source link

How to plot/compute/store SBTs + metadata? #939

Open olgabot opened 4 years ago

olgabot commented 4 years ago

Hello, Hope you're doing well! I'm curious what your thoughts are for dealing with interpreting and visualizing sourmash signatures given some metadata.

My workflow is:

  1. compute sourmash signatures on a bunch (~1e4 - 1e6) of single cells with a lot of different ksizes, molecules, log2sketchsizes (same molecule, ksize, sketch size aka your "scaled" = "sketch id") using https://github.com/nf-core/kmermaid/
  2. Load signatures from the same "sketch id", do an all-by-all comparison to get a similarity matrix
  3. Input similarity matrix into plotting (e.g. UMAP) algorithms, coloring by metadata e.g. image
  4. Input similarity matrix into clustering (e.g. leiden graph-based clustering) algorithms to identify "blobs"
  5. Do "differential hash expression" -- hashes shared by clusters but not by random samples
  6. Un-hash the hashes to get the protein sequence, then use diamond blastp on Refseq to predict the protein using https://github.com/czbiohub/nf-predictorthologs/pull/25

The middle steps of 2-5 are quite manual and painful... What is the current best way to do this? Would there be room for a data structure/model that deals with SBTs + metadata and their maniuplations?

One potential model is the AnnData + Scanpy model, where AnnData is the underlying storage data structure, whereas Scanpy does all the fun compute, clustering, etc.

Eventually I'd like sourmash SBTs to be a first-class citizen in cellxgene but .. that's a long way off. For now, one can compute multiple umaps and input those as several coordinates.

Thanks! Olga

ctb commented 4 years ago

Well, my meta process has been as follows:

This is kind of how the whole LCA module evolved (blog posts + github repos -> sourmash lca!) and is how the sourmash cluster/cocluster stuff is evolving too.

This works pretty well for incubating code that for API changes or extensions to sourmash, while keeping the code labile by avoiding sourmash UX/testing/etc. limitations. It also avoids the temptation to top-down design big features and instead focus on demonstrated utility. And, last but not least, it bypasses all the annoyance of writing lots of tests for code that gets repeatedly refactored on short time scales.

Your differential hash expression does sound like something that we are encountering in other projects, BTW. I'd be interested in hearing more!

olgabot commented 4 years ago

Okay yeah that makes sense.. it's definitely a work in progress and I'm definitely guilty of over-engineering up front when things could just use a simple solution. Will keep you posted on scripts/notebooks/repos/blog posts

olgabot commented 4 years ago

Here's a start for differential hash expression: https://github.com/czbiohub/nf-predictorthologs/pull/26

ctb commented 4 years ago

hmm @taylorreiter that looks maybe interesting for IBD or other metadata correlation stuff, similar to what you're doing with RF.

taylorreiter commented 4 years ago

I would agree that dealing with hashes in quite manual and painful, but using workflows has helped (https://github.com/dib-lab/2020-ibd/blob/master/Snakefile#L176).

I think I take a similar approach as you have with differential expression. I converted by signatures into csvs (https://github.com/dib-lab/2020-ibd/blob/master/scripts/sig_to_csv.py), and then combined these csvs into a single dataframe that record abundance information where samples are columns and hashes are rows (https://github.com/dib-lab/2020-ibd/blob/master/Snakefile#L266). This in itself was sort of painful, as the dataframe can get really big and annoying to handle. However once I have that dataframe, I also have a csv with metadata and I use these two dataframes to do random forests/differential expression etc. between groups (ex https://github.com/dib-lab/2020-ibd/blob/master/scripts/vita_rf.R). This is all written in R and coordinated with Snakemake; so far I haven't written it to be terribly extensible to other problems, but I'm thinking of heading that direction.

I'm excited to look more closely at your differential expression stuff!

luizirber commented 4 years ago

One potential model is the AnnData + Scanpy model, where AnnData is the underlying storage data structure, whereas Scanpy does all the fun compute, clustering, etc.

That is sort of the idea with the Storage class in sourmash, and now that zipped SBTs are a thing there are also discussions to add taxonomy to the zip file, other extra information is game too.

I'm not a big fan of HDF5 because it is not easy to put it in the browser, or as easy to work in the CLI as a zip file. At one point I looked into N5 + Zarr, but it is not that well supported yet.