Closed Rafael-Silva-Oliveira closed 4 months ago
Hi Rafael,
The df format should be one row per transcript, with three columns for x,y coords and gene ID.
You can see an example input format here: https://gitlab.com/kfbenjamin/topact/-/blob/master/topact-demo/spatial.csv
So e.g. for Stereo-seq although we have on the order of millions of spots and tens of thousands of genes, a typical input file will only have millions of rows (1-2 transcripts per spot).
Hope this makes sense!
Hi Rafael,
The df format should be one row per transcript, with three columns for x,y coords and gene ID.
You can see an example input format here: https://gitlab.com/kfbenjamin/topact/-/blob/master/topact-demo/spatial.csv
So e.g. for Stereo-seq although we have on the order of millions of spots and tens of thousands of genes, a typical input file will only have millions of rows (1-2 transcripts per spot).
Hope this makes sense!
Hi, thank you for the reply!
Any ideia on how to pass a dataframe like this one to the desired long format:
Trying to replace 0s with nans and then trying to drop those while doing the stacking for the long format, but it always gives a memory error.
The spot ID is pretty much the same as x and y coordinates (I just need to merge that info later on once I have the long format). Then I'd have the transcript per coordinate that has a count >= 1. But I can't just do that filtering on the wide format, I'd need to stack it and then do the filtering (or some other way in between), but it gets way too large of a dataframe to do anything with it
I've updated the function so that it returns the necessary components to run topACT (mtx, genes, labels and the spatial information) from scanpy/anndata. I had to loop through each gene and stack it in long format and then concat in a single dataframe at the end removing the 0 counts. The concatenated dataframe has around 205 million rows... index gene x y count s_008um_00634_00647-1 Samd11 634 647 1.0 s_008um_00190_00220-1 Samd11 190 220 1.0 s_008um_00630_00180-1 Samd11 630 180 1.0 s_008um_00257_00478-1 Samd11 257 478 1.0 s_008um_00651_00616-1 Samd11 651 616 1.0 ... ... ... ... ... s_008um_00252_00421-1 Mt-cyb 252 421 1.0 s_008um_00384_00793-1 Mt-cyb 384 793 1.0 s_008um_00653_00166-1 Mt-cyb 653 166 1.0 s_008um_00565_00244-1 Mt-cyb 565 244 4.0 s_008um_00353_00477-1 Mt-cyb 353 477 4.0
It can be however a bit slow depending on the number of genes a user has
This introduces new dependencies: tqdm, scanpy, anndata, and loguru
import pandas as pd
from scipy import io
from topact.countdata import CountMatrix
from topact.classifier import SVCClassifier, train_from_countmatrix
from topact import spatial
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import scipy.sparse
from loguru import logger
from tqdm import tqdm
import anndata as ad
def readfile(filename):
with open(filename) as f:
return [line.rstrip() for line in f]
adata_st = sc.read_h5ad(r"/path/to/spatial_anndata/raw_adata.h5ad")
adata_sc = sc.read_h5ad(r"/path/to/scRNA_anndata/raw_adata.h5ad")
def annData_converter(
sc_adata: ad.AnnData,
st_adata: ad.AnnData,
sc_annotation: str = "ann_level_3_transferred_label",
):
"""
Converts single-cell and spatial transcriptomics data into a format suitable for further analysis.
This function makes gene names unique, extracts count matrix, gene names, and labels from single-cell data,
creates a wide format count matrix from spatial data, stacks each individual gene in long format, and
concatenates all stacked genes into a single long format dataframe.
Args:
sc_adata (anndata.AnnData): The single-cell RNA-seq data in an AnnData object.
st_adata (anndata.AnnData): The spatial transcriptomics data in an AnnData object.
sc_annotation (str, optional): The annotation level to use from the single-cell data.
Defaults to "ann_level_3_transferred_label".
Returns:
mtx (scipy.sparse.coo_matrix): The count matrix from the single-cell data.
genes (list): The list of gene names.
labels (list): The list of labels.
counts_stacked (pandas.DataFrame): The concatenated dataframe of all stacked genes in long format.
"""
logger.info("Making gene names unique for both scRNA and ST dataset.")
sc_adata.var_names_make_unique()
st_adata.var_names_make_unique()
# Extract the count matrix from the single-cell data
mtx = scipy.sparse.coo_matrix(sc_adata.X).astype("int64")
# Extract the gene names
genes: list = sc_adata.var_names.tolist()
# Extract the labels
labels: list = sc_adata.obs[sc_annotation].tolist()
# Extract the columns from spatial_adata.obs that contain the x,y coordinates (row = X, col = Y)
spatial_data = st_adata.obs[["array_row", "array_col"]]
logger.info(
"Creating a wide format count matrix from spatial data with gene as columns and cell ID as rows."
)
# Create the DataFrame in wide format
counts = pd.DataFrame.sparse.from_spmatrix(
st_adata.X,
columns=st_adata.var_names,
index=spatial_data.index,
)
# Create an empty list to save each stacked gene in long format
stacked_columns = []
logger.info("Stacking each individual gene in long format")
for column in tqdm(counts.columns):
stacked_column = (
counts[column]
.sparse.to_dense()
.replace(0, np.nan)
.dropna()
.rename("counts")
.to_frame()
)
stacked_column["gene"] = column
merged_stacked_column = (
stacked_column.merge(spatial_data, left_index=True, right_index=True)
.rename(columns={"array_row": "x", "array_col": "y"})
.reindex(columns=["gene", "x", "y", "counts"])
)
# Concate the current gene to the list of stacked columns
stacked_columns.append(merged_stacked_column)
logger.info("Concatenating all stacked genes into a single long format dataframe")
# Append all stacked genes into a single dataframe with columns gene, x,y, counts
counts_stacked = pd.concat(stacked_columns)
return mtx, genes, labels, counts_stacked
if __name__ == "__main__":
# Convert AnnData to necessary formats
mtx, genes, labels, spatial_data = annData_converter(adata_sc, adata_st)
# Create TopACT object
sc_obj = CountMatrix(mtx, genes=genes)
sc_obj.add_metadata("celltype", labels)
# Train local classifier
clf = SVCClassifier()
train_from_countmatrix(clf, sc_obj, "celltype")
# Passing in genes will automatically filter out genes that are not in the
# single cell reference
sd = spatial.CountGrid.from_coord_table(
spatial_data, genes=genes, count_col="counts", gene_col="gene"
)
# Classify
sd.classify_parallel(
clf, min_scale=3, max_scale=9, num_proc=1, outfile="outfile.npy"
)
confidence_mtx = np.load("outfile.npy")
annotations = spatial.extract_image(confidence_mtx, 0.5)
np.savetxt("demo-output.txt", annotations, fmt="%1.f")
plt.imshow(annotations, interpolation="None")
plt.savefig("demo-output.png")
Would also be quite nice to use Optuna for hyperparameter optimiziation on the SVM classifier. Would be cool to test different kernels to try and capture non-linear relations
As an update on this, and while the concept is impressive in theory, I'm not seeing it how one can apply this in a big dataset such as Visium HD with over 600k bins and 18k genes for example (even using the 16 micron resolution would be a struggle). This combination results in over 200 million rows on the spatial.csv file and even with a subset of 30k bins, it took 2 minutes and 38 seconds to go from the spatial.CountGrid.from_coord_table
step to the classify_parallel
step which then generated a npy file of 300mb in size. It has been stuck here for a while now on the classify_parallel
method. And this was using 5% of the original dataset, so I'm afraid this approach isn't scalable to large datasets (I have another script running since yesterday with the full dataset and it hasn't left the from_coord_table
yet)
Example of the subset of 30k bins with 18k genes (results in 10 million rows for the spatial.csv):
gene x y counts
s_008um_00000_00061-1 Jak1 0 61 1.0
s_008um_00000_00061-1 Tstd1 0 61 1.0
s_008um_00000_00061-1 Akt3 0 61 1.0
s_008um_00000_00061-1 Ypel5 0 61 1.0
s_008um_00000_00061-1 Cartpt 0 61 2.0
... ... ... ... ...
s_008um_00824_00131-1 Fbrs 824 131 1.0
s_008um_00824_00131-1 Dynll2 824 131 1.0
s_008um_00824_00131-1 Mt-nd4 824 131 1.0
s_008um_00832_00701-1 C7orf26 832 701 1.0
s_008um_00832_00701-1 Mlec 832 701 1.0
[10244001 rows x 4 columns]
One solution to this could be to just keep the highly differentiated genes (say, choose only the top 5000 highly variable genes), make sure MT and ribo genes are removed, perform some extra QC steps, etc, but it should be able to run quickly with at least 10k genes; I used scanpy filter highly variable genes choosing 5.000 genes and this created a spatial.csv file with 3.5 million rows:
gene x y counts
s_008um_00000_00061-1 Akt3 0 61 1.0
s_008um_00000_00061-1 Ypel5 0 61 1.0
s_008um_00000_00061-1 Cbx3 0 61 1.0
s_008um_00000_00061-1 Mtif3 0 61 1.0
s_008um_00000_00061-1 Kctd12 0 61 1.0
... ... ... ... ...
s_008um_00823_00471-1 Atp6v1d 823 471 1.0
s_008um_00824_00131-1 Thbs3 824 131 1.0
s_008um_00824_00131-1 Psmd14 824 131 1.0
s_008um_00824_00131-1 Nsa2 824 131 1.0
s_008um_00824_00131-1 Pnpla2 824 131 1.0
[3541686 rows x 4 columns]
But in some use cases we might want to run this with all or most genes, so having to subset this much in terms of genes isn't ideal
Very neat paper however and really great attempt at trying to solve a very present problem specially with the latest visium HD technology (single-cell spatial transcriptomics cell type mapping using a scRNA reference dataset)
Feel free to close this issue or keeping it open for other users to discuss more on this
Edit: here's the output from 3.5 million rows after 5 hours running a script (it's a subset, so the result means nothing, but naturally not scalable as it takes a too long to get the desired output). Also the outputs show the cell type integers, but not the decoded labels (the actual cell type names). Since the model isn't being saved, and the label encoder isn't saved either, I'm not entirely sure what each integer is which cell type
3 days later of running with 580k bins and 5k highly variable genes (not meaningful):
Hello,
I am encountering the same problem:
I have a few questions regarding these issues:
Thank you for your assistance.
Hello,
I am encountering the same problem:
- The analysis takes a very long time to process all data (stereo-seq).
- The final annotations only provide integers instead of cell type labels.
- The tool does not segment to the cell level, only grouping bin1 into different integers.
I have a few questions regarding these issues:
- Is there a way to speed up the data analysis process?
- How can I get the actual cell type labels instead of integers in the final annotations?
- How can we perform cell-level analysis, such as cell-cell communication, if the tool only groups bin1 into different integers?
Thank you for your assistance.
I believe this would require quite a bit of code restructuring or would be only useful for very small and specific regions of the data. Regarding the labels, the SVM that is being trained under the hood should retrieve the label encoding, perhaps it is possible to get the labels from the trained clf. The classifier seems to also only use a linear kernel, so it won't capture more complex non-linear interactions I'm afraid (hence why I suggested some form of hyperparameter optimization). I think this work was more of a theoretical approach rather than practical/for real world datasets unfortunately :)
Would also be quite nice to use Optuna for hyperparameter optimiziation on the SVM classifier. Would be cool to test different kernels to try and capture non-linear relations
I agree, there is certainly scope for more sophisticated local classifiers.
As an update on this, and while the concept is impressive in theory, I'm not seeing it how one can apply this in a big dataset such as Visium HD with over 600k bins and 18k genes for example (even using the 16 micron resolution would be a struggle).
I would agree that there is significant scope for more performant implementations of the method. In the publication we had no issue running the method on very large Stereo-seq data sets at the 500-700nm level with >20k genes, but of course results may vary depending on your machine. As noted below things can be sped up significantly by batching the spatial data to limit the memory usage at any one time.
I have a few questions regarding these issues:
- Is there a way to speed up the data analysis process?
- How can I get the actual cell type labels instead of integers in the final annotations?
- How can we perform cell-level analysis, such as cell-cell communication, if the tool only groups bin1 into different integers?
Thank you for your assistance.
All very good questions.
clf
as clf.classes
.All very good questions.
- The most reliable way to speed up the process is to split the spatial data set into smaller regions. It is e.g. much quicker to process 100 10x10 chunks than a single 100x100 chunk. This is a limitation of the implementation (it holds the entire spatial data set in memory and access reads from within that array) which could certainly be improved with some further engineering.
- Cell type labels are stored in the local classifier
clf
asclf.classes
.- You're correct that TopACT only performs a bin-1 classification. If the bin-1 classification is suitable then you can go on to perform further image processing to identify distinct cells, and some of the code in this repository (e.g. https://github.com/katherine-benjamin/topact-paper/blob/main/Figure%203/Extract_cells.ipynb) can give you some ideas on that.
Thank you for your reply.
For point 2, if the clf.classes are ['B_cell', 'CD4_T', 'CD8_T', 'Endothelial', 'Epithelial', 'Fibroblast', 'Macrophage', 'Mast', 'Monocyte', 'Plasma', 'Treg'], the corresponding integers will be: 0: 'B_cell', 1: 'CD4_T', 2: 'CD8_T', 3:'Endothelial', 4:'Epithelial', 5:'Fibroblast', 6:'Macrophage', 7:'Mast', 8:'Monocyte', 9:'Plasma', 10:'Treg'
Is that correct?
I'm trying to create a function that extracts the necessary information directly from adata files
But the df for spatial seems like an odd choice? If I have 17000 genes with 600k bins/coordinates, then we'll have a dataframe of 600.000 x 17000 (10,200,000,000 rows). This will naturally result in memory errors. Any other way to do this that is not as memory intensive? I was quite happy to see the mention of Visum HD, but it doesn't seem to be quite compatible
Thanks