single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[Bug] - Ingesting multiple H5ADs with `.from_h5ad` #2172

Closed srdsam closed 5 months ago

srdsam commented 7 months ago

Describe the bug

When I ingest multiple h5ad files with tiledbsoma.io.from_h5ad using a registration_mapping (generated with tiledbsoma.io.register_h5ads), the from_h5ad function didn't throw any errors. However only the data from the first h5ad seemed to populate the SOMA object. Downstream objects populate obs and var correctly, but not X (is just filled with 0s).

I ended up circumventing this error by using the append functions (e.g. append_obs, append_var, and append_X). Unsure if I was misusing the tiledbsoma.io.from_h5ad function or registration mapping? It data ingestion to work file with fewer than 10 h5ads (didn't profile exactly).

To Reproduce

The ingest function took in a list of h5ads converted from GTEx's bulk tissue data. Happy to provide full code or list of h5ads if that helps. Just want to sanity check my use of TileDB-SOMA first.

def ingest(
    data: Union[List[Union[os.PathLike, str]], Union[os.PathLike, str]],
    modality: str,
    base_path: str,
    data_source: str,
    registration_mapping: Optional[ExperimentAmbientLabelMapping] = None,
) -> str:
    """
    Ingest data into a TileDB-SOMA experiment.
    """
    # Validate AnnData Objects
    data = validate(data)
    data = [data] if isinstance(data, (str, os.PathLike)) else data

    # Generate registration mapping
    if registration_mapping is None:
        registration_mapping = register_h5ads(
            experiment_uri=(
                base_path if Experiment.exists(base_path, context=SOMA_CTX) else None
            ),
            h5ad_file_names=data,
            measurement_name=modality,
            obs_field_name="obs_id",
            var_field_name="var_id",
            context=SOMA_CTX,
        )

    # Ingest raw h5ad files
    soma_uri = _ingest_h5ads(data, base_path, modality, registration_mapping)

    logger.info(f"Size of SOMA: {get_size_of_dir(base_path):.2f} MB")
    return soma_uri

@profile
def _ingest_h5ads(
    file_paths: List[str],
    base_path: str,
    modality: str,
    registration_mapping: ExperimentAmbientLabelMapping,
) -> str:
    """
    Handles ingestion of data from a list of file paths.
    """
    total_size = 0
    uri = None
    num_files = len(file_paths)
    logger.debug(f"registration mapping: \n{registration_mapping}")
    for index, h5ad in enumerate(file_paths):
        logger.debug(f"ingesting file {index + 1}/{num_files}: {h5ad}")
        total_size += get_size_of_dir(h5ad)
        uri = upload(h5ad, base_path, modality, registration_mapping)
    logger.info(f"size of H5ADs: {total_size:.2f} MB")
    return uri

@retry(Exception)
def upload(
    h5ad: str,
    base_path: str,
    modality: str,
    registration_mapping: ExperimentAmbientLabelMapping,
    mode="write",
) -> str:
    """
    Uploads an AnnData object to a TileDB-SOMA experiment.
    """
    with warnings.catch_warnings():
        uri = from_h5ad(
            experiment_uri=base_path,
            input_path=h5ad,
            measurement_name=modality,
            X_layer_name="raw",
            ingest_mode=mode,
            context=SOMA_CTX,
            registration_mapping=registration_mapping,
        )
    return uri

Versions (please complete the following information):

Additional context

Also noticed that a similar issue when ingesting multiple measurements into the same SOMA object, obs is not written correctly. The 'single-cell' data in obs always ends up overwriting the bulk. Solved this by just creating seperate SOMAs.

For reference this is the code that works fine:

@profile
def ingest(
    data: List[Union[os.PathLike, str]],
    modality: str,
    base_path: str,
    data_source: str,
    registration_mapping: Optional[ExperimentAmbientLabelMapping] = None,
) -> str:
    """
    Ingest data into a TileDB-SOMA experiment. Expectation is that data is a
    list of h5ad files, that have been formatted according to the schemas in
    soma_builder/builder/schemas. `adata.X` should be raw counts, and
    `adata.layer['normalized']` should be normalized counts.
    """
    # Validate AnnData Objects
    data = validate(data)

    # Ingest h5ad files (and normalize if missing)
    soma_uri = _ingest_h5ads(data, base_path, modality, registration_mapping)

    # Precompute stats
    precompute_stats(base_path, modality)

    # Add metadata
    add_metadata(base_path, data_source)
    logger.info(f"Size of SOMA: {get_size_of_dir(base_path):.2f} MB")
    return soma_uri

@profile
def _ingest_h5ads(
    file_paths: List[str],
    base_path: str,
    modality: str,
    registration_mapping: ExperimentAmbientLabelMapping,
) -> str:
    """
    Handles ingestion from a list of file paths.
    """
    total_size = 0
    uri = None
    num_files = len(file_paths)

    # Generate registration mapping
    if registration_mapping is None:
        uri = base_path if Experiment.exists(base_path, SOMA_CTX) else None
        logger.info(f"registration uri: {str(uri)}")
        registration_mapping: ExperimentAmbientLabelMapping = somaio.register_h5ads(
            experiment_uri=uri,
            h5ad_file_names=file_paths,
            measurement_name=modality,
            obs_field_name="obs_id",
            var_field_name="var_id",
            context=SOMA_CTX,
        )
        logger.info(f"registration_mapping:\n{registration_mapping}")

    # Creating SOMA Schema
    template = scanpy.read_h5ad(file_paths[0])
    if "normalized" not in template.layers.keys():
        template.layers["normalized"] = coo_matrix(
            (
                np.array([], dtype=np.float32),
                (np.array([], dtype=int), np.array([], dtype=int)),
            ),
            shape=template.X.shape,
            dtype=np.float32,
        )
    #TODO: Update to adapt X_kind based on bulk or single-cell
    somaio.from_anndata(
        experiment_uri=base_path,
        anndata=template,
        measurement_name=modality,
        X_layer_name="raw",
        ingest_mode="schema_only",
        context=SOMA_CTX,
        registration_mapping=registration_mapping,
    )

    # Ingesting Files
    for index, h5ad in enumerate(file_paths):
        logger.debug(f"ingesting file {index + 1}/{num_files}: {h5ad}")
        total_size += get_size_of_dir(h5ad)
        uri = _upload(
            h5ad,
            base_path,
            modality,
            registration_mapping,
        )
    logger.info(f"size of H5ADs: {total_size:.2f} MB")
    return uri

@retry(Exception)
def _upload(
    h5ad: str,
    base_path: str,
    modality: str,
    registration_mapping: ExperimentAmbientLabelMapping,
) -> str:
    """
    Uploads an AnnData object to a TileDB-SOMA experiment.

    Note:
        Using standard somaio.ingest function with registration mapping causes incorrect mapping between `obs` and `X`.
    """
    if Experiment.exists(base_path, SOMA_CTX):
        with Experiment.open(uri=base_path, mode="w", context=SOMA_CTX) as exp:
            adata = scanpy.read_h5ad(h5ad)
            # add to obs
            somaio.append_obs(
                exp,
                adata.obs,
                registration_mapping=registration_mapping,
                context=SOMA_CTX,
            )
            # add to var
            somaio.append_var(
                exp,
                adata.var,
                measurement_name=modality,
                registration_mapping=registration_mapping,
                context=SOMA_CTX,
            )
            # add to raw layer indexing along obs_id and var_id
            somaio.append_X(
                exp,
                adata.X,
                measurement_name=modality,
                X_layer_name="raw",
                obs_ids=list(adata.obs["obs_id"]),
                var_ids=list(adata.var["var_id"]),
                registration_mapping=registration_mapping,
                context=SOMA_CTX,
            )
            # handle creating/adding normalized layer indexing along obs_id and var_id
            if "normalized" in adata.layers.keys():
                somaio.append_X(
                    exp,
                    adata.layers["normalized"],
                    measurement_name=modality,
                    X_layer_name="normalized",
                    obs_ids=list(adata.obs["obs_id"]),
                    var_ids=list(adata.var["var_id"]),
                    registration_mapping=registration_mapping,
                    context=SOMA_CTX,
                )
            else:
                """
                From CXG Census:
                    > The normalized layer is built by dividing each value in the raw count matrix by
                      its corresponding row sum (i.e. size normalization).
                """
                norm_X = scanpy.pp.normalize_total(
                    adata,
                    target_sum=1,
                    exclude_highly_expressed=True,
                    max_fraction=0.2,
                    inplace=False,
                )["X"]
                somaio.append_X(
                    exp,
                    norm_X,
                    measurement_name=modality,
                    X_layer_name="normalized",
                    obs_ids=list(adata.obs["obs_id"]),
                    var_ids=list(adata.var["var_id"]),
                    registration_mapping=registration_mapping,
                    context=SOMA_CTX,
                )
            del adata
            gc.collect()
    else:
        raise RuntimeError("Schema does not exist")
    return base_path

Thanks in advance!

johnkerl commented 7 months ago

@srdsam thanks for the report!

Indeed, if you can please share input-data files, that would be helpful. I tried your script above and was unable to reproduce the problem.

If you can share full data, great -- or, it would suffice at first to share the "obs_id" and "var_id" columns for adata.obs and adata.var for each of your input-data files. (My guess is that the obs_id fields are repeated across your multiple .h5ad files, but I'd love to validate this guess.)

srdsam commented 7 months ago

Thanks for the prompt reply @johnkerl ! Sorry been a bit over-subscribed this week but I'll regenerate and try to share them this weekend. I'll also check uniqueness myself.

Also as a quick general question, is the best place to reach out for questions around TileDB-SOMA here or via Slack (CZI Science/TileDB's one)?

For context:

Working on leveraging TileDB-SOMA in a production environment to develop a data explorer for the census among other datasets. Been taking notes from CZI's projects (single-cell-data-explorer, cellxgene etc.), but still have some questions (likely run into a few edge cases) and am hesitant to over-clutter the issue pages here with non-issues.

P.S. I've also been able to figure out a lot from your issues/PRs here so the discussions here have been super helpful :)

johnkerl commented 7 months ago

Thanks for the prompt reply @johnkerl ! Sorry been a bit over-subscribed this week but I'll regenerate and try to share them this weekend. I'll also check uniqueness myself.

All good! :)

Also as a quick general question, is the best place to reach out for questions around TileDB-SOMA here or via Slack (CZI Science/TileDB's one)?

Either is fine! :) I guess I'd lean slightly toward here as it's public by default

For context: ...

That is helpful, thanks!

P.S. I've also been able to figure out a lot from your issues/PRs here so the discussions here have been super helpful :)

Thanks for the positive feedback -- the goal indeed is to build up in realtime a transparent reference corpus 🙏

johnkerl commented 5 months ago

Hi @srdsam -- should we close this?

srdsam commented 5 months ago

Yep! Closed :)