scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
575 stars 152 forks source link

Support for trees (e.g., ete3 trees) #832

Open jolespin opened 2 years ago

jolespin commented 2 years ago

How can I store phylogenetic trees in my anndata file object?

Here's the code:

import gzip, requests
import anndata as an
import ete3
from Bio.SeqIO.FastaIO import SimpleFastaParser

def read_url(url, params=None, **kwargs):
    r = requests.get(url, params=params, **kwargs)
    return r.content

# Load data
X = pd.read_csv("https://github.com/jolespin/walkthroughs/raw/main/data/microbiome__acute-malnutrition/X.tsv.gz", sep="\t", index_col=0)
df_meta_samples = pd.read_csv("https://github.com/jolespin/walkthroughs/raw/main/data/microbiome__acute-malnutrition/Y.tsv.gz", sep="\t", index_col=0)
df_taxonomy = pd.read_csv("https://github.com/jolespin/walkthroughs/raw/main/data/microbiome__acute-malnutrition/taxonomy.tsv.gz", sep="\t", index_col=0)

# Terrible hack for fasta that I'm not proud of
seq_data = pd.read_csv("https://github.com/jolespin/walkthroughs/raw/main/data/microbiome__acute-malnutrition/sequences.fasta.gz", sep="\t", header=None, index_col=0).index
headers = seq_data[np.arange(0,seq_data.size, step=2)].map(lambda id:id[1:])
seqs = seq_data[np.arange(0,seq_data.size, step=2) + 1]
sequences = pd.Series(dict(zip(headers, seqs)))

# ETE3 tree
tree = ete3.Tree(
    newick=read_url("https://github.com/jolespin/walkthroughs/raw/main/data/microbiome__acute-malnutrition/phylogeny.nwk").decode(),
    quoted_node_names=True,
)

# Anndata object
adata = an.AnnData(
    X=X,
    uns={"phylogeny":tree},
    obs=df_meta_samples.loc[X.index],
    var=pd.concat([df_taxonomy, sequences.to_frame("sequences")], axis=1).loc[X.columns],
)
adata.write("../../GitHub/walkthroughs/data/microbiome__dental-caries/test.h5ad", compression="gzip")

Here's the error:

/var/folders/tl/d72ycs3s6fz_x99nddcb_lch0003c0/T/ipykernel_75172/1130599964.py:1: FutureWarning: X.dtype being converted to np.float32 from int64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  adata = an.AnnData(
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    213         try:
--> 214             return func(elem, key, val, *args, **kwargs)
    215         except Exception as e:

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs)
    174     else:
--> 175         _REGISTRY.get_writer(dest_type, t, modifiers)(f, k, elem, *args, **kwargs)
    176 

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/specs/registry.py in get_writer(self, dest_type, typ, modifiers)
     63         if (dest_type, typ, modifiers) not in self.write:
---> 64             raise TypeError(
     65                 f"No method has been defined for writing {typ} elements to {dest_type}"

TypeError: No method has been defined for writing <class 'ete3.coretype.tree.TreeNode'> elements to <class 'h5py._hl.group.Group'>

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
/var/folders/tl/d72ycs3s6fz_x99nddcb_lch0003c0/T/ipykernel_75172/1130599964.py in <module>
      5     var=pd.concat([df_taxonomy, sequences.to_frame("sequences")], axis=1).loc[X.columns],
      6 )
----> 7 adata.write("../../GitHub/walkthroughs/data/microbiome__dental-caries/test.h5ad", compression="gzip")

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   1916             filename = self.filename
   1917 
-> 1918         _write_h5ad(
   1919             Path(filename),
   1920             self,

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
    103         write_elem(f, "varp", dict(adata.varp), dataset_kwargs=dataset_kwargs)
    104         write_elem(f, "layers", dict(adata.layers), dataset_kwargs=dataset_kwargs)
--> 105         write_elem(f, "uns", dict(adata.uns), dataset_kwargs=dataset_kwargs)
    106 
    107 

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    212     def func_wrapper(elem, key, val, *args, **kwargs):
    213         try:
--> 214             return func(elem, key, val, *args, **kwargs)
    215         except Exception as e:
    216             if "Above error raised while writing key" in format(e):

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs)
    173         )
    174     else:
--> 175         _REGISTRY.get_writer(dest_type, t, modifiers)(f, k, elem, *args, **kwargs)
    176 
    177 

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/specs/registry.py in wrapper(g, k, *args, **kwargs)
     22         @wraps(func)
     23         def wrapper(g, k, *args, **kwargs):
---> 24             result = func(g, k, *args, **kwargs)
     25             g[k].attrs.setdefault("encoding-type", spec.encoding_type)
     26             g[k].attrs.setdefault("encoding-version", spec.encoding_version)

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/specs/methods.py in write_mapping(f, k, v, dataset_kwargs)
    279     g = f.create_group(k)
    280     for sub_k, sub_v in v.items():
--> 281         write_elem(g, sub_k, sub_v, dataset_kwargs=dataset_kwargs)
    282 
    283 

~/anaconda3/envs/soothsayer_py3.9_env/lib/python3.9/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    218             else:
    219                 parent = _get_parent(elem)
--> 220                 raise type(e)(
    221                     f"{e}\n\n"
    222                     f"Above error raised while writing key {key!r} of {type(elem)} "

TypeError: No method has been defined for writing <class 'ete3.coretype.tree.TreeNode'> elements to <class 'h5py._hl.group.Group'>

Above error raised while writing key 'phylogeny' of <class 'h5py._hl.group.Group'> to /
ivirshup commented 2 years ago

What you're trying to store is a tree where the nodes are observations in your anndata, right?

Could you represent it as a directed graph in adjacency format in obsp?

jolespin commented 2 years ago

Yea, you totally can but it would defeat the purpose of having the convenience of everything in one object that's usable. You're anndata program would be EXTREMELY useful for the microbiome/metagenomics community. I created a Dataset object in my soothsayer package that is similar but I'm just going to trash it and use your anndata object b/c it's much more streamlined. Anndata could be akin to phyloseq objects in R or similar to qiime2 artifact objects if it can be flexible with holding phylogenetic trees. Lots of overlap between the needs of scRNA-seq and metagenomics.

ivirshup commented 2 years ago

So, I would be interested in having storage for tree like objects inside of anndata. This has also come up for genomics.

I'm not too familiar with metagenomic packages, but had been thinking of an analogy to TreeSummarizedExperiment from bioconductor.

Yea, you totally can but it would defeat the purpose of having the convenience of everything in one object that's usable.

We have to be pretty selective here about the types we support. One of the main goals of this project is maintaining interoperable data formats. We need to figure out how to encode each new type in a way other languages would be able to read an interpret it. In practice, this means supporting more basic, common types – which complicated ones are often composed of. We'd also end needing to support a huge number of methods, which isn't sustainable.

That said, we are looking to expose some internal methods for letting people encode custom types, though the cost here is whether other people would be able to read your data later.


One important question here, would you want to keep your tree "aligned" with the anndata object? E.g. if it could be stored in obsp would you want that?

jolespin commented 2 years ago

We have to be pretty selective here about the types we support. One of the main goals of this project is maintaining interoperable data formats. We need to figure out how to encode each new type in a way other languages would be able to read an interpret it. In practice, this means supporting more basic, common types – which complicated ones are often composed of. We'd also end needing to support a huge number of methods, which isn't sustainable.

I agree, one reason why I gravitated towards anndata was how few dependencies it has and the generalizability. What if the object is just stored as a newick string? This will be tricky because if you're working with ETE3 or Skbio trees it will have to automatically detect and store as newick during saving. This wouldn't be too crazy to implement but it would need to know that the objects are trees. I could image a separate slot called obst and vart that has phylogenetic trees?

During file writing, it could do something like the pseduo-code below:

# Add variable level tree 
adata.vart = tree_for_variables # or add them like layers adata.vart["some_treee_name"] = tree_for_variables

# Writing adata with tree
# http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#writing-newick-trees
# http://scikit-bio.org/docs/0.5.6/generated/skbio.tree.TreeNode.write.html#skbio.tree.TreeNode.write
if adata.vart is not None:
     adata.vart = adata.vart.write()
     store h5ad

# Reading adata with tree 
if skbio or ete3 installed:
    convert newick tree to one of them # or not do this at all 
else:
     just leave it as a newick string

Very rough thoughts above but just trying to brainstorm on how this would work with limited dependencies.

One important question here, would you want to keep your tree "aligned" with the anndata object? E.g. if it could be stored in obsp would you want that?

When you say aligned, do you mean coupled with multiple sequence alignment fasta? I feel like this could easily be stored in the .var object as a dataframe. I keep my fasta files loaded as pandas Series which can't be written as h5ad in the current implementation so to get around it I just added them either to .var or as a single column dataframe in .varm["sequences"].

Maybe I don't entirely understand the varp and obsp objects. These are square sparse arrays that are symmetric and the labels correspond to obs_names and var_names, respectively. Is that correct?

What tool do you use for your trees in Python? ETE3 was I started using first but thought about switching to Skbio as my main tree source.

ivirshup commented 2 years ago

What if the object is just stored as a newick string?

I'm not sure using a newick string would have advantages over a sparse adjacency matrix. Here's how you could go from an adjacency matrix to an ete3 tree:

def tree_from_adjacency(adj: sparse.sparse_matrix, node_names: np.ndarray = None):
    coo = adj.tocoo()
    if node_names is not None:
        parent_nodes = node_names[coo.row]
        child_nodes = node_names[coo.col]
    else:
        parent_nodes, child_nodes = coo.row, coo.col
    return ete3.Tree.from_parent_child_table(list(zip(parent_nodes, child_nodes, coo.data)))

Here you don't need to decode anything. In addition, many graph libraries will directly take this matrix, and will be using it as their internal representation anyways.

I think it would be nice to be able to tag these matrices with expected properties. E.g. knowing that this is a tree, a DAG, or undirected.

When you say aligned, do you mean coupled with multiple sequence alignment fasta?

I mean, what are the nodes in your tree? Would there be a one-to-one relationship with these nodes and either the observations or variables of your AnnData?

It sounds like each node is a sequence, and each sequence is also a variable.

Maybe I don't entirely understand the varp and obsp objects. These are square sparse arrays that are symmetric and the labels correspond to obs_names and var_names, respectively. Is that correct?

They are square matrices where the labels correspond to obs_names and var_names respectively. They don't have to be sparse, and they don't have to be symmetric. They typically are sparse adjacency matrices representing graphs.

What tool do you use for your trees in Python? ETE3 was I started using first but thought about switching to Skbio as my main tree source.

I haven't really worked much with genetic trees. When using a library to deal with trees I most frequently use networkx or igraph. I have been wanting to use python-graphblas.

jolespin commented 2 years ago

I'm not sure using a newick string would have advantages over a sparse adjacency matrix. Here's how you could go from an adjacency matrix to an ete3 tree:

Thank you, this will come in handy.

I mean, what are the nodes in your tree? Would there be a one-to-one relationship with these nodes and either the observations or variables of your AnnData?

Yes, typically either a 1-to-1 or a superset (includes all variables and more). The former would make more sense in the context of your package tho.

They are square matrices where the labels correspond to obs_names and var_names respectively. They don't have to be sparse, and they don't have to be symmetric. They typically are sparse adjacency matrices representing graphs.

Awesome, I'll be able to use this quite a bit then. Sometimes I have similarity matrices where the diagonal isn't 1.0 or 0.0 and not symmetric.

I found a work around. It works when I pickle and unpickle.

with open("./dataset.anndata.pkl", "wb") as f:
    pickle.dump(adata, f)

with open("./dataset.anndata.pkl", "rb") as f:
    adata2 = pickle.load(f)

AnnData object with n_obs × n_vars = 107 × 155
    obs: 'SampleID', 'Sex', 'Age[Months]', 'Age[Days]', 'Visit[Day]', 'Date_Collected', 'SubjectID', 'Weight', 'Height', 'Nutritional_Status', 'WHZ', 'SubjectID(Alias)', 'SampleID(Alias)'
    var: 'Taxonomy', 'Confidence', 'Life', 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species', 'sequences'
    uns: 'phylogeny'
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!