Open jolespin opened 2 years ago
What you're trying to store is a tree where the nodes are observations in your anndata, right?
Could you represent it as a directed graph in adjacency format in obsp
?
Yea, you totally can but it would defeat the purpose of having the convenience of everything in one object that's usable. You're anndata program would be EXTREMELY useful for the microbiome/metagenomics community. I created a Dataset object in my soothsayer package that is similar but I'm just going to trash it and use your anndata object b/c it's much more streamlined. Anndata could be akin to phyloseq objects in R or similar to qiime2 artifact objects if it can be flexible with holding phylogenetic trees. Lots of overlap between the needs of scRNA-seq and metagenomics.
So, I would be interested in having storage for tree like objects inside of anndata. This has also come up for genomics.
I'm not too familiar with metagenomic packages, but had been thinking of an analogy to TreeSummarizedExperiment from bioconductor.
Yea, you totally can but it would defeat the purpose of having the convenience of everything in one object that's usable.
We have to be pretty selective here about the types we support. One of the main goals of this project is maintaining interoperable data formats. We need to figure out how to encode each new type in a way other languages would be able to read an interpret it. In practice, this means supporting more basic, common types – which complicated ones are often composed of. We'd also end needing to support a huge number of methods, which isn't sustainable.
That said, we are looking to expose some internal methods for letting people encode custom types, though the cost here is whether other people would be able to read your data later.
One important question here, would you want to keep your tree "aligned" with the anndata object? E.g. if it could be stored in obsp
would you want that?
We have to be pretty selective here about the types we support. One of the main goals of this project is maintaining interoperable data formats. We need to figure out how to encode each new type in a way other languages would be able to read an interpret it. In practice, this means supporting more basic, common types – which complicated ones are often composed of. We'd also end needing to support a huge number of methods, which isn't sustainable.
I agree, one reason why I gravitated towards anndata was how few dependencies it has and the generalizability. What if the object is just stored as a newick string? This will be tricky because if you're working with ETE3 or Skbio trees it will have to automatically detect and store as newick during saving. This wouldn't be too crazy to implement but it would need to know that the objects are trees. I could image a separate slot called obst
and vart
that has phylogenetic trees?
During file writing, it could do something like the pseduo-code below:
# Add variable level tree
adata.vart = tree_for_variables # or add them like layers adata.vart["some_treee_name"] = tree_for_variables
# Writing adata with tree
# http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#writing-newick-trees
# http://scikit-bio.org/docs/0.5.6/generated/skbio.tree.TreeNode.write.html#skbio.tree.TreeNode.write
if adata.vart is not None:
adata.vart = adata.vart.write()
store h5ad
# Reading adata with tree
if skbio or ete3 installed:
convert newick tree to one of them # or not do this at all
else:
just leave it as a newick string
Very rough thoughts above but just trying to brainstorm on how this would work with limited dependencies.
One important question here, would you want to keep your tree "aligned" with the anndata object? E.g. if it could be stored in obsp would you want that?
When you say aligned, do you mean coupled with multiple sequence alignment fasta? I feel like this could easily be stored in the .var
object as a dataframe. I keep my fasta files loaded as pandas Series which can't be written as h5ad in the current implementation so to get around it I just added them either to .var
or as a single column dataframe in .varm["sequences"]
.
Maybe I don't entirely understand the varp
and obsp
objects. These are square sparse arrays that are symmetric and the labels correspond to obs_names
and var_names
, respectively. Is that correct?
What tool do you use for your trees in Python? ETE3 was I started using first but thought about switching to Skbio as my main tree source.
What if the object is just stored as a newick string?
I'm not sure using a newick string would have advantages over a sparse adjacency matrix. Here's how you could go from an adjacency matrix to an ete3 tree:
def tree_from_adjacency(adj: sparse.sparse_matrix, node_names: np.ndarray = None):
coo = adj.tocoo()
if node_names is not None:
parent_nodes = node_names[coo.row]
child_nodes = node_names[coo.col]
else:
parent_nodes, child_nodes = coo.row, coo.col
return ete3.Tree.from_parent_child_table(list(zip(parent_nodes, child_nodes, coo.data)))
Here you don't need to decode anything. In addition, many graph libraries will directly take this matrix, and will be using it as their internal representation anyways.
I think it would be nice to be able to tag these matrices with expected properties. E.g. knowing that this is a tree, a DAG, or undirected.
When you say aligned, do you mean coupled with multiple sequence alignment fasta?
I mean, what are the nodes in your tree? Would there be a one-to-one relationship with these nodes and either the observations or variables of your AnnData
?
It sounds like each node is a sequence, and each sequence is also a variable.
Maybe I don't entirely understand the varp and obsp objects. These are square sparse arrays that are symmetric and the labels correspond to obs_names and var_names, respectively. Is that correct?
They are square matrices where the labels correspond to obs_names and var_names respectively. They don't have to be sparse, and they don't have to be symmetric. They typically are sparse adjacency matrices representing graphs.
What tool do you use for your trees in Python? ETE3 was I started using first but thought about switching to Skbio as my main tree source.
I haven't really worked much with genetic trees. When using a library to deal with trees I most frequently use networkx or igraph. I have been wanting to use python-graphblas.
I'm not sure using a newick string would have advantages over a sparse adjacency matrix. Here's how you could go from an adjacency matrix to an ete3 tree:
Thank you, this will come in handy.
I mean, what are the nodes in your tree? Would there be a one-to-one relationship with these nodes and either the observations or variables of your AnnData?
Yes, typically either a 1-to-1 or a superset (includes all variables and more). The former would make more sense in the context of your package tho.
They are square matrices where the labels correspond to obs_names and var_names respectively. They don't have to be sparse, and they don't have to be symmetric. They typically are sparse adjacency matrices representing graphs.
Awesome, I'll be able to use this quite a bit then. Sometimes I have similarity matrices where the diagonal isn't 1.0 or 0.0 and not symmetric.
I found a work around. It works when I pickle and unpickle.
with open("./dataset.anndata.pkl", "wb") as f:
pickle.dump(adata, f)
with open("./dataset.anndata.pkl", "rb") as f:
adata2 = pickle.load(f)
AnnData object with n_obs × n_vars = 107 × 155
obs: 'SampleID', 'Sex', 'Age[Months]', 'Age[Days]', 'Visit[Day]', 'Date_Collected', 'SubjectID', 'Weight', 'Height', 'Nutritional_Status', 'WHZ', 'SubjectID(Alias)', 'SampleID(Alias)'
var: 'Taxonomy', 'Confidence', 'Life', 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species', 'sequences'
uns: 'phylogeny'
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
How can I store phylogenetic trees in my anndata file object?
Here's the code:
Here's the error: