scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
577 stars 154 forks source link

Reading Anndata from only parts of h5ad file: Hack solution #1517

Open mtvector opened 5 months ago

mtvector commented 5 months ago

I came across a previous issue #436 and couldn't get the dask solution working with my application, so I came up with a somewhat hacky solution to reading only the desired fields from an h5ad into an anndata (not chunking). It works by making a tree of all the fields in the H5, searching the tree for fields matching the ones you want to load, then loading the ancestors and descendants of that field. (see code below). Useful if you want to keep all your data together on disk, but only need to load some fields into memory.

Basically you run:

read_h5ad_backed_selective(model_path / 'p3_adata.h5ad', mode='r', selected_keys=['spliced', 'S_score', 'batch_name', 'var', 'uns', 'X_antipode'])

and get back:

AnnData object with n_obs × n_vars = 493759 × 16738 backed at './1.9.1.8.3_Dev-DMR/p3_adata.h5ad'
    obs: 'S_score', 'batch_name'
    var: '_index', 'h_gene', 'h_gene_size', 'h_left_flank', 'h_right_flank', 'h_syntenic'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'antipode_cluster_colors'
    obsm: 'X_antipode'
    layers: 'spliced'

Just wanted to share in case it is useful to someone!

# Author: mtvector a.k.a. Matthew Schmitz
import h5py
import anndata
from anndata import AnnData
from pathlib import Path
from typing import Union, Literal, List, Dict
from ete3 import Tree
from scipy.sparse import csr_matrix
import pandas as pd

def h5_tree(val):
    tree = {}
    for key, item in val.items():
        if isinstance(item, h5py._hl.group.Group):
            tree[key] = h5_tree(item)
        else:
            try:
                tree[key] = len(item)
            except TypeError:
                tree[key] = "scalar"
    return tree

def dict_to_ete3_tree(d, parent=None):
    if parent is None:
        parent = Tree(name="root")
    for k, v in d.items():
        child = parent.add_child(name=k)
        if isinstance(v, dict):
            dict_to_ete3_tree(v, child)
    return parent

def ete3_tree_to_dict(tree):
    def helper(node):
        if node.is_leaf():
            return node.name
        d = {}
        for child in node.get_children():
            d[child.name] = helper(child)
        return d

    root_dict = {}
    for child in tree.get_children():
        root_dict[child.name] = helper(child)
    return root_dict

def prune_tree(tree, keys):
    t = dict_to_ete3_tree(tree)

    nodes_to_keep = set()

    # Find all nodes matching the keys and collect their ancestors and descendants
    for key in keys:
        for node in t.search_nodes(name=key):
            nodes_to_keep.update(node.iter_ancestors())
            nodes_to_keep.update(node.iter_descendants())
            nodes_to_keep.add(node)

    # Prune the original tree by removing nodes that are not in nodes_to_keep
    for node in t.traverse("postorder"):
        if node not in nodes_to_keep and node.up:
            node.detach()

    pruned_dict = ete3_tree_to_dict(t)
    return pruned_dict

def read_h5_to_dict(h5_group, pruned_tree):
    def helper(group, subtree):
        result = {}
        for key, value in subtree.items():
            if isinstance(value, dict):
                # Ensure key exists in the group before trying to access it
                if key in group:
                    result[key] = helper(group[key], value)
                else:
                    result[key] = None  # Handle missing keys gracefully
            else:
                # Ensure key exists in the group before trying to access it
                if key in group:
                    if isinstance(group[key], h5py.Dataset):
                        if group[key].shape == ():
                            result[key] = group[key][()]  # Read scalar dataset
                        else:
                            data = group[key][:]
                            # Decode binary strings to regular strings if necessary
                            if data.dtype.kind == "S":
                                data = data.astype(str)
                            result[key] = data  # Read non-scalar dataset
                    else:
                        result[key] = None  # Handle non-dataset values gracefully
                else:
                    result[key] = None  # Handle missing keys gracefully
        return result

    return helper(h5_group, pruned_tree)

def convert_to_dataframe(data):
    df_dict = {}
    for key, value in data.items():
        if isinstance(value, dict) and "categories" in value and "codes" in value:
            categories = [
                cat.decode("utf-8") if isinstance(cat, bytes) else cat
                for cat in value["categories"]
            ]
            codes = value["codes"]
            df_dict[key] = pd.Categorical.from_codes(codes, categories)
        elif (
            isinstance(value, dict)
            and "data" in value
            and "indices" in value
            and "indptr" in value
        ):
            df_dict[key] = csr_matrix(
                (value["data"], value["indices"], value["indptr"])
            )
        else:
            if value.dtype.kind == "S":
                value = [
                    v.decode("utf-8") if isinstance(v, bytes) else v for v in value
                ]
            df_dict[key] = value
    df = pd.DataFrame(df_dict)
    return df

def handle_special_keys(data):
    if "obs" in data:
        data["obs"] = convert_to_dataframe(data["obs"])
    if "var" in data:
        data["var"] = convert_to_dataframe(data["var"])
    if ("layers" in data) or (data == "X"):
        try:
            for layer in data["layers"]:
                data["layers"][layer] = csr_matrix(
                    (
                        data["layers"][layer]["data"],
                        data["layers"][layer]["indices"],
                        data["layers"][layer]["indptr"],
                    )
                )
        except:
            pass
    return data

def read_h5ad_backed_selective(
    filename: Union[str, Path],
    mode: Literal["r", "r+"],
    selected_keys: List[str] = [],
    return_dict: bool = False,
) -> AnnData:
    f = h5py.File(filename, mode)

    attributes = ["obsm", "varm", "obsp", "varp", "uns", "layers"]
    df_attributes = ["obs", "var"]

    tree = h5_tree(f)
    selected_tree = prune_tree(tree, selected_keys)
    with f:
        d = read_h5_to_dict(f, selected_tree)
        d = handle_special_keys(d)
    d["filename"] = filename
    d["filemode"] = mode
    if return_dict:
        return d
    else:
        adata = AnnData(**d)
        return adata
flying-sheep commented 5 months ago

This does seem useful, thank you!

As you mentioned, it doesn’t integrate well with our versioned IO yet, so it’s really more of a recipe than a PR for now.

This might be a good addition for our “How to” tutorials section.

Are you interested in writing a little notebook?

mtvector commented 5 months ago

@flying-sheep Yeah I can try to make a little notebook, I'm also working on a function to overwrite selected fields as well that I can include. Should I just commit it and link you here?

rm1113 commented 5 months ago

Perhaps our more general solution that based on anndata._io.specs import read_elem could address this use case as well https://pypi.org/project/cap-anndata/

flying-sheep commented 5 months ago

Interesting! You don’t have to use an internal API for it btw, we’re exporting e.g. anndata.experimental.read_elem by now.

flying-sheep commented 5 months ago

Should I just commit it and link you here?

yeah, I’ll check if it’s a candidate for inclusion in our tutorial notebooks with few changes and we can go from there.

mtvector commented 5 months ago

Wow I wasn't aware of cap-anndata, that seems like a much more robust solution, I think that should probably be an example notebook rather than my hack solution!

On Wed, Jun 12, 2024, 23:59 Philipp A. @.***> wrote:

Should I just commit it and link you here?

yeah, I’ll check if it’s a candidate for inclusion in our tutorial notebooks with few changes and we can go from there.

— Reply to this email directly, view it on GitHub https://github.com/scverse/anndata/issues/1517#issuecomment-2164714888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGX7H2ZKM2DDYO4NOX47BLTZHE7N7AVCNFSM6AAAAABI6F3CROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRUG4YTIOBYHA . You are receiving this because you authored the thread.Message ID: @.***>

gtca commented 4 months ago

Just stumbled upon this issue, and as it's pretty recent, wanted to give a pointer to yet another experimental approach to only load parts of the data — https://github.com/scverse/shadows. Please give us feedback in case you end up trying it out!

mtvector commented 4 months ago

I love it when you start a conversation with an ad hoc approach and end up with several robust, purpose-built solutions :)