morris-lab / CellOracle

This is the alpha version of the CellOracle package
Other
306 stars 53 forks source link

TFs ranked by negative PS sum #107

Open sneddonucsf opened 1 year ago

sneddonucsf commented 1 year ago

Hello,

Fantastic analysis package and documentation - I am very interested in delving deeper into its capabilities!

I would like to rank TFs based on their negative PS Sum values, like you presented in Figure 5a of the Nature publication. Would you be able to provide code for how you did this with the data in those figures? I would like to perform something similar for my dataset. Thank you very much and congrats on the publication!

sneddonucsf commented 1 year ago

I think also a tutorial for something like Figure 1i would be greatly appreciated!

KenjiKamimoto-ac commented 1 year ago

@sneddonucsf

I'm Kenji, author/developer of CellOracle. Thank you very much! Yes, I'm happy to share codes for the systematic perturbation and PS sum calculation. I have not included the code for this part in the tutorial so far because it is a bit complicated and messy. Please let me organize the codes. Also, I'd like to add some comments and explanations before sharing them. I will share them in a week.

sneddonucsf commented 1 year ago

@KenjiKamimoto-wustl122 that sounds great, thanks!

KenjiKamimoto-ac commented 1 year ago

@sneddonucsf

I shared my notebooks for the systematic simulation analysis. https://github.com/morris-lab/CellOracle/tree/master/docs/notebooks/05_simulation/misc/scale_up_simulation

The original notebooks were messy, so I put some functions into CellOracle package to organize them nicely. Can you please update celloracle to version 0.12.0 and try the notebook in the folder above?

Kenji

sneddonucsf commented 1 year ago

@KenjiKamimoto-wustl122 thank you so much, these worked great! One question - you mention that finding the hyperparameters before doing the systematic perturbations is key. Which parameter are those, just the ones that we set in when analyzing a single gene (like in the tutorial)? Does the vm parameter matter, or is that more for visualization for the inner product scores?

KenjiKamimoto-ac commented 1 year ago

@sneddonucsf

I'm sorry for the lack of explanation about that. And thank you for the great question!

1) The most critical parameter is dimensional reduction/trajectory data. CellOracle plots vectors on the 2D trajectory embedding space, and we need to ensure the structure represents the biologically appropriate structure. There are a lot of hyper-parameters when we create the trajectory embedding, and it requires biological knowledge to evaluate it. For example, our tutorial used a force-directed graph and got a branching tree-like structure that represents a hematopoiesis hierarchical differentiation. I recommend using a force-directed graph, but it may not be effective for some scRNA-seq data. Instead, you may use UMAP or TSNE. When you want to use PS score metrics, a tree-like trajectory graph structure will give robust results because the differentiation flow can be simpler and clearer on the tree-like structure. If the trajectory structure is an uncertain shape or scatter shape, the vector of differentiation will not make sense on the trajectory; thus, the PS score will be messy.

2) Regarding the point above, please evaluate your vector map for the developmental graph. If the development vector map does not make sense, we cannot interpret the PS score.

3) The next important parameter is n_neighbors parameter in this step. https://morris-lab.github.io/CellOracle.documentation/notebooks/05_simulation/Gata1_KO_simulation_with_Paul_etal_2015_data.html#3.-In-silico-TF-perturbation-analysis The parameter controls the number of local neighbor cells used in the calculation of vectors. Please start with n_neighbors =200, and increase the number. In general, I recommend using a higher number if you have a large data.

4) vm parameter controls scales of color map, it will change visualization, but it does not affect TF ranking based on PS values.

sneddonucsf commented 1 year ago

@KenjiKamimoto-wustl122 that all makes sense, thank you very much! Sorry, last question. In your Nature publication, you use the randomized simulation to calculate a FDR cutoff for the systemic perturbations (i.e. Extended Data Figure 3G). Is that required for all systemic perturbation analyses, or just something extra you included?

KenjiKamimoto-ac commented 1 year ago

@sneddonucsf

It was just an extra thing. It is not required for the all systematic perturbation analysis.

Best, Kenji

jfischerwue commented 1 year ago

Hello @KenjiKamimoto-wustl122,

first of all, thank you for this amazing tool! I am currently trying to analyze a T cell dataset of synovial fluid and it worked great so far! I tried KO simulation for several TFs. I am only having trouble with the systematic KO simulation analysis. when I run the pipeline function as in your Notebook, I get a ps_sum of -0.0 for all analyzed TFs. I get the same result with the tutorial data from Paul et al. Did I do something wrong or is anyone else having the same issue?

Thanks a lot for your help!

Best Jonas

As an example i analyzed only Gata1 with the pipeline (tutorial data from Paul et al.):

import copy import glob import importlib import time import os import shutil import sys from importlib import reload import matplotlib.colors as colors import matplotlib.pyplot as plt import numpy as np import pandas as pd import scanpy as sc import seaborn as sns from tqdm.notebook import tqdm

import celloracle as co from celloracle.applications import Oracle_development_module, Oracle_systematic_analysis_helper co.__version__

oracle = co.data.load_tutorial_oracle_object() links = co.data.load_tutorial_links_object()

gradient = co.load_hdf5("./Paul_etal.celloracle.gradient")

print(oracle.adata.shape, gradient.embedding.shape) assert((oracle.adata.obsm[oracle.embedding_name] == gradient.embedding).all())

--> (2671, 1999) (2671, 2)

links.filter_links() oracle.get_cluster_specific_TFdict_from_Links(links_object=links) oracle.fit_GRN_for_simulation(alpha=10, use_cluster_specific_TFdict=True)

# Get cell_id cell_idx_Lineage_GMP = np.where(oracle.adata.obs["louvain_annot"].isin([ 'GMP_0', 'GMP_1', 'GMP_2', 'GMPl_0', 'GMPl_1', 'Gran_0', 'Gran_1', 'Gran_2', 'Gran_3', 'Mo_0', 'Mo_1', 'Mo_2']))[0] cell_idx_Lineage_MEP = np.where(oracle.adata.obs["louvain_annot"].isin([ 'Ery_0', 'Ery_1', 'Ery_2', 'Ery_3', 'Ery_4', 'Ery_5', 'Ery_6', 'Ery_7', 'Ery_8', 'Ery_9', 'MEP_0', 'Mk_0']))[0] cell_idx_granulocytes = np.where(oracle.adata.obs["louvain_annot"].isin(['GMPl_1', 'Gran_0', 'Gran_1', 'Gran_2', 'Gran_3']))[0] cell_idx_monocytes = np.where(oracle.adata.obs["louvain_annot"].isin(['GMPl_0', 'Mo_0', 'Mo_1', 'Mo_2']))[0]

# Make dictionary to store the cell index list index_dictionary = {"Whole_cells": None, "Lineage_GM": cell_idx_Lineage_GMP, "Lineage_ME": cell_idx_Lineage_MEP, "Granulocytes": cell_idx_granulocytes, "Monocytes": cell_idx_monocytes}

Pipeline Function # 0. Define parameters n_propagation = 3 n_neighbors=200

file_path = "Systematic_simulation_results_Paul_data.celloracle.hdf5" # Please use .hdf5 for extension.

def pipeline(gene_for_KO):

# 1. Simulate KO oracle.simulate_shift(perturb_condition={gene_for_KO: 0}, ignore_warning=True, n_propagation=3) oracle.estimate_transition_prob(n_neighbors=n_neighbors, knn_random=True, sampled_fraction=1) oracle.calculate_embedding_shift(sigma_corr=0.05)

# Do simulation for all conditions. for lineage_name, cell_idx in index_dictionary.items():

dev = Oracle_development_module() # Load development flow dev.load_differentiation_reference_data(gradient_object=gradient) # Load simulation result dev.load_perturb_simulation_data(oracle_object=oracle, cell_idx_use=cell_idx, name=lineage_name) # Calculate inner product dev.calculate_inner_product() dev.calculate_digitized_ip(n_bins=10)

# Save results in a hdf5 file. dev.set_hdf_path(path=file_path) dev.dump_hdf5(gene=gene_for_KO, misc=lineage_name)

Running pipeline only with Gata1 as example %%time # Test pipeline with Gata1 gene pipeline(gene_for_KO="Gata1")

Load data with Oracle_systematic_analysis_helper. helper = Oracle_systematic_analysis_helper(hdf5_file_path=file_path) ps = helper.calculate_negative_ps_p_value(misc="Lineage_ME") ps

-->

  gene ps_sum
0 Gata1 -0.0
lpuche commented 1 year ago

Hi! As mentioned @jfischerwue in the comment: https://github.com/morris-lab/CellOracle/issues/107#issuecomment-1468227091 I have had the same issue. If you're going to a bit deeper and find the deep part of this function, you could get how it is managing the data to get 0.0 PS.

In my case, Instead of using:
ps_sum = self.get_negative_PS_p_value(pseudotime=pseudotime, return_ps_sum=True, plot=False)[1] I've used: ps_sum = self.get_negative_PS_p_value(return_ps_sum=True)[1]

After running the modified function, the output is: Out[140]: gene ps_sum_me 0 Onecut2 1.845766 1 Myt1l 1.827973 2 Hivep3 1.826264 3 Pou2f2 1.777027 4 Zbtb7c 1.758463 .. ... ... 211 Tead4 0.000991

Finally, we could get a scatter plot like this: Screenshot 2023-03-29 at 13 54 12

Best, Lorenzo