Open olgabot opened 4 years ago
Well, my meta process has been as follows:
This is kind of how the whole LCA module evolved (blog posts + github repos -> sourmash lca!) and is how the sourmash cluster/cocluster stuff is evolving too.
This works pretty well for incubating code that for API changes or extensions to sourmash, while keeping the code labile by avoiding sourmash UX/testing/etc. limitations. It also avoids the temptation to top-down design big features and instead focus on demonstrated utility. And, last but not least, it bypasses all the annoyance of writing lots of tests for code that gets repeatedly refactored on short time scales.
Your differential hash expression does sound like something that we are encountering in other projects, BTW. I'd be interested in hearing more!
Okay yeah that makes sense.. it's definitely a work in progress and I'm definitely guilty of over-engineering up front when things could just use a simple solution. Will keep you posted on scripts/notebooks/repos/blog posts
Here's a start for differential hash expression: https://github.com/czbiohub/nf-predictorthologs/pull/26
hmm @taylorreiter that looks maybe interesting for IBD or other metadata correlation stuff, similar to what you're doing with RF.
I would agree that dealing with hashes in quite manual and painful, but using workflows has helped (https://github.com/dib-lab/2020-ibd/blob/master/Snakefile#L176).
I think I take a similar approach as you have with differential expression. I converted by signatures into csvs (https://github.com/dib-lab/2020-ibd/blob/master/scripts/sig_to_csv.py), and then combined these csvs into a single dataframe that record abundance information where samples are columns and hashes are rows (https://github.com/dib-lab/2020-ibd/blob/master/Snakefile#L266). This in itself was sort of painful, as the dataframe can get really big and annoying to handle. However once I have that dataframe, I also have a csv with metadata and I use these two dataframes to do random forests/differential expression etc. between groups (ex https://github.com/dib-lab/2020-ibd/blob/master/scripts/vita_rf.R). This is all written in R and coordinated with Snakemake; so far I haven't written it to be terribly extensible to other problems, but I'm thinking of heading that direction.
I'm excited to look more closely at your differential expression stuff!
One potential model is the AnnData + Scanpy model, where AnnData is the underlying storage data structure, whereas Scanpy does all the fun compute, clustering, etc.
That is sort of the idea with the Storage
class in sourmash, and now that zipped SBTs are a thing there are also discussions to add taxonomy to the zip file, other extra information is game too.
I'm not a big fan of HDF5 because it is not easy to put it in the browser, or as easy to work in the CLI as a zip file. At one point I looked into N5 + Zarr, but it is not that well supported yet.
Hello, Hope you're doing well! I'm curious what your thoughts are for dealing with interpreting and visualizing sourmash signatures given some metadata.
My workflow is:
diamond blastp
on Refseq to predict the protein using https://github.com/czbiohub/nf-predictorthologs/pull/25The middle steps of 2-5 are quite manual and painful... What is the current best way to do this? Would there be room for a data structure/model that deals with SBTs + metadata and their maniuplations?
One potential model is the AnnData + Scanpy model, where AnnData is the underlying storage data structure, whereas Scanpy does all the fun compute, clustering, etc.
Eventually I'd like sourmash SBTs to be a first-class citizen in cellxgene but .. that's a long way off. For now, one can compute multiple umaps and input those as several coordinates.
Thanks! Olga