rajewsky-lab / novosparc

BSD 3-Clause "New" or "Revised" License
125 stars 41 forks source link

novosparc exhausts memory when anndata.X in sparse matrix format (CSC or CSR) #54

Open sztankatt opened 2 years ago

sztankatt commented 2 years ago

Hi, I've been trying to run novosparc with an anndata object, which has its matrix '.X' in sparse dataformat. I've tried both 'scipy.sparse.csc_matrix' and 'scipy.sparse.csr_matrix' and for both, the 'tissue.reconstruct' function hangs, and requests enormous amounts of memory (after 250GB I killed the process). If I convert the sparse matrix into a regular 'numpy' array, there is no problem, novosparc runs.

Below is a code example:


import numpy as np
import pandas as pd
import scanpy as sc

import novosparc

data_dir = '/data/local/rajewsky/home/tsztank/projects/spatial/repos/sts-paper/'
data_dir += 'projects/slideseq_v2/processed_data/slideseq_v2_mouse_hippo_new/'
data_dir += 'illumina/complete_data/automated_analysis/slideseq/umi_cutoff_50'

dataset = sc.read(f'{data_dir}/results.h5ad')
num_cells = 2000

#sc.pp.subsample(dataset, n_obs=num_cells)
dataset = dataset[dataset.obs.total_counts>1000, :]

gene_names = dataset.var.index.tolist()

num_cells, num_genes = dataset.shape

is_var_gene = dataset.var['highly_variable']
# select only 100 genes
var_genes = list(is_var_gene.index[is_var_gene])[:100]

dge_rep = dataset.to_df()[var_genes]

from scipy.sparse import csc_matrix, csr_matrix

# if you uncomment this, dataset.X will be stored not as sparse matrix, rather
# as a regular numpy array. uncommenting this doesnt throw an error
# M = dataset.X.toarray()
# del dataset.X
# dataset.X = M

num_locations = 1000
locations_circle = novosparc.gm.construct_circle(num_locations = num_locations)

tissue = novosparc.cm.Tissue(dataset=dataset, locations=locations_circle)

num_neighbors_s = num_neighbors_t = 5

tissue.setup_smooth_costs(dge_rep=dge_rep, num_neighbors_s=num_neighbors_s, 
    num_neighbors_t=num_neighbors_t)

tissue.reconstruct(alpha_linear=0, epsilon=5e-3)