immunogenomics / cna-sim

Simulation code for CNA evaluation
GNU General Public License v3.0
0 stars 0 forks source link

About sampleXmeta.C #1

Open mugpeng opened 2 years ago

mugpeng commented 2 years ago

Hi,

I recently read your work and also interested about the works related with perturbation. And I was trying to mimic your procedures for creating simulation data and ground truth.

However, I have some problems in sim_null.

There is an element named C in your MultiAnnData object which is single cell data from tbru, but I didn't find a column named C in its metadata from GEO(only batch information). What should I do to the GEO data and its metadata in order to make the object like yours running the sim_null script.

My simple process is below.

path = "~/2.data/1-cell_state_single_cell/4-TBRU500k_tuberculosis"
harmcca20 = sc.read_csv(f"{path}/GSE158769_exprs_norm.tsv", 
                        delimiter = "\t")
metadata = pd.read_csv(f"{path}/GSE158769_meta_data.txt.gz", sep = "\t")
# add metadata as samplem
# create MultiAnnData obj
harmcca20 = mad.MultiAnnData(X=harmcca20,
                     samplem=metadata)

Thanks.

rumker commented 2 years ago

Hi there! Thank you for your interest in our work and for your query. We use C to store the total number of cells in each sample.

mugpeng commented 2 years ago

Thanks for your reply! And could you serve the script you deal with tbru, fibroblast and pbmc.

For example, GEO data only gives me a count file, and it need to be splited into RNA and protein modalities and they need to be integrated (CCA) later as you described in the paper. Besides, there are also some calculations you make like C or any others.

Because it's hard for me to search clues in cna-display and cna-sim, which I only find the process you make simulation data for tbru dataset.

rumker commented 2 years ago

Hi @mugpeng! We received these data objects from the study authors with substantial processing from their source publications (e.g. a multimodal CCA embedding for the TB dataset, cluster assignments for all three datasets). I believe the primary pre-processing you'll want in a data object you feed to one of our sim.py scripts includes: 1) total cells per sample stored in d.samplem.C , 2) cluster assignments stored in data.obs under a label of your choice that you feed as the causal_clustering argument, 3) pre-processing with scanpy to construct a PCA embedding and nearest-neighbor graph of the cells, and 4) pre-processing with CNA to construct a neighborhood abundance matrix.