ttgump / spaVAE

Dependency-aware deep generative models for multitasking analysis of spatial genomics data
Apache License 2.0
30 stars 2 forks source link

How to run for custom dataset? #1

Open PSSUN opened 2 days ago

PSSUN commented 2 days ago

I run spaVAE by command line:

python run_spaVAE.py --data_file output_file.h5 --device cpu --inducing_point_steps 6

and I got 2 files: denoised_counts.txt and final_latent.txt

The documentation doesn't specify how to handle the command-line output. How can I view my results? I checked the code in the ipynb file provided in the tutorial, but it doesn't match the command-line output. For example, part of the code in the tutorial for DLPFC151673 is as follows:

y = np.array(data_mat['Y']).astype('U26') # ground-truth labels

The #ground-truth labels line requires labels to exist in the original data source, but for unanalyzed data, there are no natural labels. How do I see my results in this case?

How can I deal with denoised_counts.txt and final_latent.txt?

PSSUN commented 1 day ago

I changed part of the code to remove the part used to calculate the ari, and now I can get the result. If you have similar questions, you can refer to the following code:

import pandas as pd
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.metrics import pairwise_distances
import h5py
### refine clustering labels by the majority of neighbors
def refine(sample_id, pred, dis, shape="square"):
    refined_pred=[]
    pred=pd.DataFrame({"pred": pred}, index=sample_id)
    dis_df=pd.DataFrame(dis, index=sample_id, columns=sample_id)
    if shape=="hexagon":
        num_nbs=6 
    elif shape=="square":
        num_nbs=4
    else:
        print("Shape not recongized, shape='hexagon' for Visium data, 'square' for ST data.")
    for i in range(len(sample_id)):
        index=sample_id[i]
        dis_tmp=dis_df.loc[index, :].sort_values()
        nbs=dis_tmp.iloc[0:(num_nbs+1)]
        nbs_pred=pred.loc[nbs.index, "pred"]
        self_pred=pred.loc[index, "pred"]
        v_c=nbs_pred.value_counts()
        if (v_c.loc[self_pred]<num_nbs/2) and (np.max(v_c)>num_nbs/2):
            refined_pred.append(v_c.idxmax())
        else:           
            refined_pred.append(self_pred)
        if (i+1) % 1000 == 0:
            print("Processed", i+1, "lines")
    return np.array(refined_pred)
import numpy as np
data_mat = h5py.File('output_file.h5', 'r')

data_mat.close()

final_latent = np.loadtxt("./final_latent.txt", delimiter=",")
pred = KMeans(n_clusters=7, n_init=100).fit_predict(final_latent)
np.savetxt("clustering_labels.txt", pred, delimiter=",", fmt="%i")

dis = pairwise_distances(pos, metric="euclidean", n_jobs=-1).astype(np.double)
pred_refined = refine(np.arange(pred.shape[0]), pred, dis, shape="hexagon")
np.savetxt("refined_clustering_labels.txt", pred_refined, delimiter=",", fmt="%i")