katherine-benjamin / topact-paper

Code for the paper `Multiscale topology classifies and quantifies cell types in subcellular spatial transcriptomics"
2 stars 0 forks source link

Trying with Visium HD with 600k bins... spatial.csv becomes massive and memory intensive - Not usable for Visium HD data #3

Open Rafael-Silva-Oliveira opened 1 week ago

Rafael-Silva-Oliveira commented 1 week ago

I'm trying to create a function that extracts the necessary information directly from adata files


def annData_converter(
    sc_adata, st_adata, sc_annotation="ann_level_3_transferred_label"
):
    # Extract the count matrix from the single-cell data
    mtx = scipy.sparse.coo_matrix(sc_adata.X).astype("int64")

    # Extract the gene names
    genes = sc_adata.var_names.tolist()

    # Extract the labels
    labels = sc_adata.obs[sc_annotation].tolist()
    spatial_data = st_adata.obs[["array_row", "array_col"]]

    # Reshape the st_adata.X matrix and create a new DataFrame
    counts = pd.DataFrame(
        st_adata.X.toarray(), columns=st_adata.var_names, index=spatial_data.index
    )
    counts = counts.stack().reset_index()
    counts.columns = ["gene", "spot", "counts"]

    # Merge the counts DataFrame with the spatial_data DataFrame
    spatial_data = spatial_data.reset_index().rename(columns={"index": "spot"})
    counts = counts.merge(spatial_data, left_on="spot", right_on="spot")

    # Rename the columns
    counts = counts.rename(columns={"array_row": "x", "array_col": "y"})

    # Reorder the columns
    counts = counts[["x", "y", "gene", "counts"]]

    return mtx, genes, labels, spatial_data

But the df for spatial seems like an odd choice? If I have 17000 genes with 600k bins/coordinates, then we'll have a dataframe of 600.000 x 17000 (10,200,000,000 rows). This will naturally result in memory errors. Any other way to do this that is not as memory intensive? I was quite happy to see the mention of Visum HD, but it doesn't seem to be quite compatible

Thanks

katherine-benjamin commented 1 week ago

Hi Rafael,

The df format should be one row per transcript, with three columns for x,y coords and gene ID.

You can see an example input format here: https://gitlab.com/kfbenjamin/topact/-/blob/master/topact-demo/spatial.csv

So e.g. for Stereo-seq although we have on the order of millions of spots and tens of thousands of genes, a typical input file will only have millions of rows (1-2 transcripts per spot).

Hope this makes sense!

Rafael-Silva-Oliveira commented 1 week ago

Hi Rafael,

The df format should be one row per transcript, with three columns for x,y coords and gene ID.

You can see an example input format here: https://gitlab.com/kfbenjamin/topact/-/blob/master/topact-demo/spatial.csv

So e.g. for Stereo-seq although we have on the order of millions of spots and tens of thousands of genes, a typical input file will only have millions of rows (1-2 transcripts per spot).

Hope this makes sense!

Hi, thank you for the reply!

Any ideia on how to pass a dataframe like this one to the desired long format:

image

Trying to replace 0s with nans and then trying to drop those while doing the stacking for the long format, but it always gives a memory error.

The spot ID is pretty much the same as x and y coordinates (I just need to merge that info later on once I have the long format). Then I'd have the transcript per coordinate that has a count >= 1. But I can't just do that filtering on the wide format, I'd need to stack it and then do the filtering (or some other way in between), but it gets way too large of a dataframe to do anything with it

Rafael-Silva-Oliveira commented 1 week ago

I've updated the function so that it returns the necessary components to run topACT (mtx, genes, labels and the spatial information) from scanpy/anndata. I had to loop through each gene and stack it in long format and then concat in a single dataframe at the end removing the 0 counts. The concatenated dataframe has around 205 million rows... index gene x y count s_008um_00634_00647-1 Samd11 634 647 1.0 s_008um_00190_00220-1 Samd11 190 220 1.0 s_008um_00630_00180-1 Samd11 630 180 1.0 s_008um_00257_00478-1 Samd11 257 478 1.0 s_008um_00651_00616-1 Samd11 651 616 1.0 ... ... ... ... ... s_008um_00252_00421-1 Mt-cyb 252 421 1.0 s_008um_00384_00793-1 Mt-cyb 384 793 1.0 s_008um_00653_00166-1 Mt-cyb 653 166 1.0 s_008um_00565_00244-1 Mt-cyb 565 244 4.0 s_008um_00353_00477-1 Mt-cyb 353 477 4.0

It can be however a bit slow depending on the number of genes a user has image

This introduces new dependencies: tqdm, scanpy, anndata, and loguru

import pandas as pd

from scipy import io

from topact.countdata import CountMatrix
from topact.classifier import SVCClassifier, train_from_countmatrix
from topact import spatial
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import scipy.sparse
from loguru import logger
from tqdm import tqdm
import anndata as ad

def readfile(filename):
    with open(filename) as f:
        return [line.rstrip() for line in f]

adata_st = sc.read_h5ad(r"/path/to/spatial_anndata/raw_adata.h5ad")
adata_sc = sc.read_h5ad(r"/path/to/scRNA_anndata/raw_adata.h5ad")

def annData_converter(
    sc_adata: ad.AnnData,
    st_adata: ad.AnnData,
    sc_annotation: str = "ann_level_3_transferred_label",
):
    """
    Converts single-cell and spatial transcriptomics data into a format suitable for further analysis.

    This function makes gene names unique, extracts count matrix, gene names, and labels from single-cell data,
    creates a wide format count matrix from spatial data, stacks each individual gene in long format, and
    concatenates all stacked genes into a single long format dataframe.

    Args:
        sc_adata (anndata.AnnData): The single-cell RNA-seq data in an AnnData object.
        st_adata (anndata.AnnData): The spatial transcriptomics data in an AnnData object.
        sc_annotation (str, optional): The annotation level to use from the single-cell data.
            Defaults to "ann_level_3_transferred_label".

    Returns:
        mtx (scipy.sparse.coo_matrix): The count matrix from the single-cell data.
        genes (list): The list of gene names.
        labels (list): The list of labels.
        counts_stacked (pandas.DataFrame): The concatenated dataframe of all stacked genes in long format.
    """

    logger.info("Making gene names unique for both scRNA and ST dataset.")
    sc_adata.var_names_make_unique()
    st_adata.var_names_make_unique()
    # Extract the count matrix from the single-cell data
    mtx = scipy.sparse.coo_matrix(sc_adata.X).astype("int64")

    # Extract the gene names
    genes: list = sc_adata.var_names.tolist()

    # Extract the labels
    labels: list = sc_adata.obs[sc_annotation].tolist()

    # Extract the columns from spatial_adata.obs that contain the x,y coordinates (row = X, col = Y)
    spatial_data = st_adata.obs[["array_row", "array_col"]]

    logger.info(
        "Creating a wide format count matrix from spatial data with gene as columns and cell ID as rows."
    )
    # Create the DataFrame in wide format
    counts = pd.DataFrame.sparse.from_spmatrix(
        st_adata.X,
        columns=st_adata.var_names,
        index=spatial_data.index,
    )

    # Create an empty list to save each stacked gene in long format
    stacked_columns = []

    logger.info("Stacking each individual gene in long format")
    for column in tqdm(counts.columns):
        stacked_column = (
            counts[column]
            .sparse.to_dense()
            .replace(0, np.nan)
            .dropna()
            .rename("counts")
            .to_frame()
        )
        stacked_column["gene"] = column

        merged_stacked_column = (
            stacked_column.merge(spatial_data, left_index=True, right_index=True)
            .rename(columns={"array_row": "x", "array_col": "y"})
            .reindex(columns=["gene", "x", "y", "counts"])
        )

        # Concate the current gene to the list of stacked columns
        stacked_columns.append(merged_stacked_column)

    logger.info("Concatenating all stacked genes into a single long format dataframe")
    # Append all stacked genes into a single dataframe with columns gene, x,y, counts
    counts_stacked = pd.concat(stacked_columns)

    return mtx, genes, labels, counts_stacked

if __name__ == "__main__":
    # Convert AnnData to necessary formats
    mtx, genes, labels, spatial_data = annData_converter(adata_sc, adata_st)

    # Create TopACT object
    sc_obj = CountMatrix(mtx, genes=genes)
    sc_obj.add_metadata("celltype", labels)

    # Train local classifier
    clf = SVCClassifier()
    train_from_countmatrix(clf, sc_obj, "celltype")

    # Passing in genes will automatically filter out genes that are not in the
    # single cell reference
    sd = spatial.CountGrid.from_coord_table(
        spatial_data, genes=genes, count_col="counts", gene_col="gene"
    )

    # Classify
    sd.classify_parallel(
        clf, min_scale=3, max_scale=9, num_proc=1, outfile="outfile.npy"
    )

    confidence_mtx = np.load("outfile.npy")

    annotations = spatial.extract_image(confidence_mtx, 0.5)

    np.savetxt("demo-output.txt", annotations, fmt="%1.f")

    plt.imshow(annotations, interpolation="None")
    plt.savefig("demo-output.png")

Would also be quite nice to use Optuna for hyperparameter optimiziation on the SVM classifier. Would be cool to test different kernels to try and capture non-linear relations

Rafael-Silva-Oliveira commented 1 week ago

As an update on this, and while the concept is impressive in theory, I'm not seeing it how one can apply this in a big dataset such as Visium HD with over 600k bins and 18k genes for example (even using the 16 micron resolution would be a struggle). This combination results in over 200 million rows on the spatial.csv file and even with a subset of 30k bins, it took 2 minutes and 38 seconds to go from the spatial.CountGrid.from_coord_table step to the classify_parallel step which then generated a npy file of 300mb in size. It has been stuck here for a while now on the classify_parallel method. And this was using 5% of the original dataset, so I'm afraid this approach isn't scalable to large datasets (I have another script running since yesterday with the full dataset and it hasn't left the from_coord_table yet)

Example of the subset of 30k bins with 18k genes (results in 10 million rows for the spatial.csv):

                          gene    x    y  counts
s_008um_00000_00061-1     Jak1    0   61     1.0
s_008um_00000_00061-1    Tstd1    0   61     1.0
s_008um_00000_00061-1     Akt3    0   61     1.0
s_008um_00000_00061-1    Ypel5    0   61     1.0
s_008um_00000_00061-1   Cartpt    0   61     2.0
...                        ...  ...  ...     ...
s_008um_00824_00131-1     Fbrs  824  131     1.0
s_008um_00824_00131-1   Dynll2  824  131     1.0
s_008um_00824_00131-1   Mt-nd4  824  131     1.0
s_008um_00832_00701-1  C7orf26  832  701     1.0
s_008um_00832_00701-1     Mlec  832  701     1.0

[10244001 rows x 4 columns]

One solution to this could be to just keep the highly differentiated genes (say, choose only the top 5000 highly variable genes), make sure MT and ribo genes are removed, perform some extra QC steps, etc, but it should be able to run quickly with at least 10k genes; I used scanpy filter highly variable genes choosing 5.000 genes and this created a spatial.csv file with 3.5 million rows:

                          gene    x    y  counts
s_008um_00000_00061-1     Akt3    0   61     1.0
s_008um_00000_00061-1    Ypel5    0   61     1.0
s_008um_00000_00061-1     Cbx3    0   61     1.0
s_008um_00000_00061-1    Mtif3    0   61     1.0
s_008um_00000_00061-1   Kctd12    0   61     1.0
...                        ...  ...  ...     ...
s_008um_00823_00471-1  Atp6v1d  823  471     1.0
s_008um_00824_00131-1    Thbs3  824  131     1.0
s_008um_00824_00131-1   Psmd14  824  131     1.0
s_008um_00824_00131-1     Nsa2  824  131     1.0
s_008um_00824_00131-1   Pnpla2  824  131     1.0

[3541686 rows x 4 columns]

But in some use cases we might want to run this with all or most genes, so having to subset this much in terms of genes isn't ideal

Very neat paper however and really great attempt at trying to solve a very present problem specially with the latest visium HD technology (single-cell spatial transcriptomics cell type mapping using a scRNA reference dataset)

Feel free to close this issue or keeping it open for other users to discuss more on this

Edit: here's the output from 3.5 million rows after 5 hours running a script (it's a subset, so the result means nothing, but naturally not scalable as it takes a too long to get the desired output). Also the outputs show the cell type integers, but not the decoded labels (the actual cell type names). Since the model isn't being saved, and the label encoder isn't saved either, I'm not entirely sure what each integer is which cell type

image

3 days later of running with 580k bins and 5k highly variable genes (not meaningful):

image

hahia commented 5 days ago

Hello,

I am encountering the same problem:

  1. The analysis takes a very long time to process all data (stereo-seq).
  2. The final annotations only provide integers instead of cell type labels.
  3. The tool does not segment to the cell level, only grouping bin1 into different integers.

I have a few questions regarding these issues:

  1. Is there a way to speed up the data analysis process?
  2. How can I get the actual cell type labels instead of integers in the final annotations?
  3. How can we perform cell-level analysis, such as cell-cell communication, if the tool only groups bin1 into different integers?

Thank you for your assistance.

Rafael-Silva-Oliveira commented 5 days ago

Hello,

I am encountering the same problem:

  1. The analysis takes a very long time to process all data (stereo-seq).
  2. The final annotations only provide integers instead of cell type labels.
  3. The tool does not segment to the cell level, only grouping bin1 into different integers.

I have a few questions regarding these issues:

  1. Is there a way to speed up the data analysis process?
  2. How can I get the actual cell type labels instead of integers in the final annotations?
  3. How can we perform cell-level analysis, such as cell-cell communication, if the tool only groups bin1 into different integers?

Thank you for your assistance.

I believe this would require quite a bit of code restructuring or would be only useful for very small and specific regions of the data. Regarding the labels, the SVM that is being trained under the hood should retrieve the label encoding, perhaps it is possible to get the labels from the trained clf. The classifier seems to also only use a linear kernel, so it won't capture more complex non-linear interactions I'm afraid (hence why I suggested some form of hyperparameter optimization). I think this work was more of a theoretical approach rather than practical/for real world datasets unfortunately :)

katherine-benjamin commented 5 days ago

Would also be quite nice to use Optuna for hyperparameter optimiziation on the SVM classifier. Would be cool to test different kernels to try and capture non-linear relations

I agree, there is certainly scope for more sophisticated local classifiers.

As an update on this, and while the concept is impressive in theory, I'm not seeing it how one can apply this in a big dataset such as Visium HD with over 600k bins and 18k genes for example (even using the 16 micron resolution would be a struggle).

I would agree that there is significant scope for more performant implementations of the method. In the publication we had no issue running the method on very large Stereo-seq data sets at the 500-700nm level with >20k genes, but of course results may vary depending on your machine. As noted below things can be sped up significantly by batching the spatial data to limit the memory usage at any one time.

I have a few questions regarding these issues:

  1. Is there a way to speed up the data analysis process?
  2. How can I get the actual cell type labels instead of integers in the final annotations?
  3. How can we perform cell-level analysis, such as cell-cell communication, if the tool only groups bin1 into different integers?

Thank you for your assistance.

All very good questions.

  1. The most reliable way to speed up the process is to split the spatial data set into smaller regions. It is e.g. much quicker to process 100 10x10 chunks than a single 100x100 chunk. This is a limitation of the implementation (it holds the entire spatial data set in memory and access reads from within that array) which could certainly be improved with some further engineering.
  2. Cell type labels are stored in the local classifier clf as clf.classes.
  3. You're correct that TopACT only performs a bin-1 classification. If the bin-1 classification is suitable then you can go on to perform further image processing to identify distinct cells, and some of the code in this repository (e.g. https://github.com/katherine-benjamin/topact-paper/blob/main/Figure%203/Extract_cells.ipynb) can give you some ideas on that.
hahia commented 4 days ago

All very good questions.

  1. The most reliable way to speed up the process is to split the spatial data set into smaller regions. It is e.g. much quicker to process 100 10x10 chunks than a single 100x100 chunk. This is a limitation of the implementation (it holds the entire spatial data set in memory and access reads from within that array) which could certainly be improved with some further engineering.
  2. Cell type labels are stored in the local classifier clf as clf.classes.
  3. You're correct that TopACT only performs a bin-1 classification. If the bin-1 classification is suitable then you can go on to perform further image processing to identify distinct cells, and some of the code in this repository (e.g. https://github.com/katherine-benjamin/topact-paper/blob/main/Figure%203/Extract_cells.ipynb) can give you some ideas on that.

Thank you for your reply.

For point 2, if the clf.classes are ['B_cell', 'CD4_T', 'CD8_T', 'Endothelial', 'Epithelial', 'Fibroblast', 'Macrophage', 'Mast', 'Monocyte', 'Plasma', 'Treg'], the corresponding integers will be: 0: 'B_cell', 1: 'CD4_T', 2: 'CD8_T', 3:'Endothelial', 4:'Epithelial', 5:'Fibroblast', 6:'Macrophage', 7:'Mast', 8:'Monocyte', 9:'Plasma', 10:'Treg'

Is that correct?