jadexq / ELLA

https://jadexq.github.io/ELLA/
2 stars 0 forks source link

Slow NHPP fitting #2

Open pakiessling opened 1 week ago

pakiessling commented 1 week ago

Hi Again, I tried out ELLA on our data and faced some issues while the demo data ran without problems. I subset my data to 50 cells (500 different features) but in 24h run time the fitting only managed to reach the 8% mark. Maybe I did something wrong?

Here is a snapshot of my input: grafik grafik

sc_total was not specified in the documentation but I assumed this was the total counts per cell.

I then ran ELLA like this:

import sys

sys.path.append(
    "../ELLA"
)
from ELLA.ELLA import ELLA
import pandas as pd
import time

def make_data_dict(expr, cell_seg):
    types = expr.type.unique()
    cells = expr.groupby("type")["cell"].agg(list).to_dict()
    cells_all = expr.cell.unique()
    genes = expr.groupby("type")["gene"].agg(list).to_dict()
    data = {
        "types": types,
        "cells": cells,
        "cells_all": cells_all,
        "genes": genes,
        "expr": expr,
        "cell_seg": cell_seg,
    }
    return data

expr = pd.read_parquet("ella_expr.parquet")
cell_seg = pd.read_parquet("ella_shapes.parquet")

# Only use CM cells
expr = expr[expr.type.isin(["CM"])]
print("Analyzing only CM cells")
cell_seg = cell_seg[cell_seg.cell.isin(expr.cell)]

# Subsample from expr.cell values
cell_nr = 50
print(f"Subsampling {cell_nr} cells")
random_values = cell_seg["cell"].drop_duplicates().sample(n=cell_nr, random_state=42)
cell_seg = cell_seg[cell_seg["cell"].isin(random_values)]
expr = expr[expr["cell"].isin(random_values)]

assert len(set(expr.cell) - set(cell_seg.cell)) == 0, "Not all cells have a shape"

print(
    f"Number of cells: {cell_seg.cell.nunique()}, number of genes: {len(expr.gene.unique())}"
)

start = time.time()

ella = ELLA(dataset="test_run")
ella.load_data(data_dict=make_data_dict(expr, cell_seg))
print("Data loaded")
ella.register_cells()
print("Cells registered")
ella.nhpp_prepare()
ella.nhpp_fit()
print("NHPP fitted")
ella.weighted_density_est()
ella.compute_pv()

ella.pattern_clustering()
ella.pattern_labeling(K=5)

stop = time.time()
print(f"Run time: {stop - start}")

Interestingly, ELLA printed

Average number of genes 24286.0 Average number of cells 24286.0

which I dont really understand.

Do you know what could be going wrong? And would be possible to run ELLA on the GPU?

jadexq commented 6 days ago

Thank you very much for your interest, and for sharing the details! The input looks very correct (Yes, sc_total is the total count per cell). And you code looks good. But, for the ELLA print -- Average number of cells 24286.0, I would expect this to be 50 as you included 50 cells, for example, in the make_data_dict, the length of cells['CM'] (a list of cell ids of CM cells) should be 50. If it is okay for you to share a portion of your data, I'd be very happy to test it out! As for running ELLA on GPU, it is possible, but we don't have a demo for that yet. We are working on a CPU paralleled version of ELLA that will be ready soon. And we are benchmarking CPU vs GPU parallel run times across settings, to see how much we can benefit from GPU parallel. Thank you!

pakiessling commented 5 days ago

The problem was with my prep code. Needed to take the set of the genes and cell ids.

def make_data_dict(expr, cell_seg):
    types = expr.type.unique()
    cells = expr.groupby("type")["cell"].agg(lambda x: list(set(x))).to_dict()
    cells_all = expr.cell.unique()
    genes = expr.groupby("type")["gene"].agg(lambda x: list(set(x))).to_dict()
    data = {
        "types": types,
        "cells": cells,
        "cells_all": cells_all,
        "genes": genes,
        "expr": expr,
        "cell_seg": cell_seg,
    }
    return data

Sorry about that 😅 Maybe consider this as a feature request of accepting just the two dataframes as input and calculating the dict internally, or maybe even cosider Spatialdata as allowed input.

jadexq commented 5 days ago

Got it! Thank you very much for the suggestions! We'll update the input style to improve this, and we'll work on incorporate Spatialdata. Thank you!