Open sneddonucsf opened 1 year ago
I think also a tutorial for something like Figure 1i would be greatly appreciated!
@sneddonucsf
I'm Kenji, author/developer of CellOracle. Thank you very much! Yes, I'm happy to share codes for the systematic perturbation and PS sum calculation. I have not included the code for this part in the tutorial so far because it is a bit complicated and messy. Please let me organize the codes. Also, I'd like to add some comments and explanations before sharing them. I will share them in a week.
@KenjiKamimoto-wustl122 that sounds great, thanks!
@sneddonucsf
I shared my notebooks for the systematic simulation analysis. https://github.com/morris-lab/CellOracle/tree/master/docs/notebooks/05_simulation/misc/scale_up_simulation
The original notebooks were messy, so I put some functions into CellOracle package to organize them nicely. Can you please update celloracle to version 0.12.0 and try the notebook in the folder above?
Kenji
@KenjiKamimoto-wustl122 thank you so much, these worked great! One question - you mention that finding the hyperparameters before doing the systematic perturbations is key. Which parameter are those, just the ones that we set in when analyzing a single gene (like in the tutorial)? Does the vm
parameter matter, or is that more for visualization for the inner product scores?
@sneddonucsf
I'm sorry for the lack of explanation about that. And thank you for the great question!
1) The most critical parameter is dimensional reduction/trajectory data. CellOracle plots vectors on the 2D trajectory embedding space, and we need to ensure the structure represents the biologically appropriate structure. There are a lot of hyper-parameters when we create the trajectory embedding, and it requires biological knowledge to evaluate it. For example, our tutorial used a force-directed graph and got a branching tree-like structure that represents a hematopoiesis hierarchical differentiation. I recommend using a force-directed graph, but it may not be effective for some scRNA-seq data. Instead, you may use UMAP or TSNE. When you want to use PS score metrics, a tree-like trajectory graph structure will give robust results because the differentiation flow can be simpler and clearer on the tree-like structure. If the trajectory structure is an uncertain shape or scatter shape, the vector of differentiation will not make sense on the trajectory; thus, the PS score will be messy.
2) Regarding the point above, please evaluate your vector map for the developmental graph. If the development vector map does not make sense, we cannot interpret the PS score.
3) The next important parameter is n_neighbors
parameter in this step. https://morris-lab.github.io/CellOracle.documentation/notebooks/05_simulation/Gata1_KO_simulation_with_Paul_etal_2015_data.html#3.-In-silico-TF-perturbation-analysis
The parameter controls the number of local neighbor cells used in the calculation of vectors.
Please start with n_neighbors
=200, and increase the number. In general, I recommend using a higher number if you have a large data.
4) vm parameter controls scales of color map, it will change visualization, but it does not affect TF ranking based on PS values.
@KenjiKamimoto-wustl122 that all makes sense, thank you very much! Sorry, last question. In your Nature publication, you use the randomized simulation to calculate a FDR cutoff for the systemic perturbations (i.e. Extended Data Figure 3G). Is that required for all systemic perturbation analyses, or just something extra you included?
@sneddonucsf
It was just an extra thing. It is not required for the all systematic perturbation analysis.
Best, Kenji
Hello @KenjiKamimoto-wustl122,
first of all, thank you for this amazing tool! I am currently trying to analyze a T cell dataset of synovial fluid and it worked great so far! I tried KO simulation for several TFs. I am only having trouble with the systematic KO simulation analysis. when I run the pipeline function as in your Notebook, I get a ps_sum of -0.0 for all analyzed TFs. I get the same result with the tutorial data from Paul et al. Did I do something wrong or is anyone else having the same issue?
Thanks a lot for your help!
Best Jonas
As an example i analyzed only Gata1 with the pipeline (tutorial data from Paul et al.):
import copy
import glob
import importlib
import time
import os
import shutil
import sys
from importlib import reload
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
from tqdm.notebook import tqdm
import celloracle as co
from celloracle.applications import Oracle_development_module, Oracle_systematic_analysis_helper
co.__version__
oracle = co.data.load_tutorial_oracle_object()
links = co.data.load_tutorial_links_object()
gradient = co.load_hdf5("./Paul_etal.celloracle.gradient")
print(oracle.adata.shape, gradient.embedding.shape)
assert((oracle.adata.obsm[oracle.embedding_name] == gradient.embedding).all())
--> (2671, 1999) (2671, 2)
links.filter_links()
oracle.get_cluster_specific_TFdict_from_Links(links_object=links)
oracle.fit_GRN_for_simulation(alpha=10, use_cluster_specific_TFdict=True)
# Get cell_id
cell_idx_Lineage_GMP = np.where(oracle.adata.obs["louvain_annot"].isin([ 'GMP_0', 'GMP_1', 'GMP_2', 'GMPl_0', 'GMPl_1', 'Gran_0', 'Gran_1', 'Gran_2', 'Gran_3', 'Mo_0', 'Mo_1', 'Mo_2']))[0]
cell_idx_Lineage_MEP = np.where(oracle.adata.obs["louvain_annot"].isin([ 'Ery_0', 'Ery_1', 'Ery_2', 'Ery_3', 'Ery_4', 'Ery_5', 'Ery_6', 'Ery_7', 'Ery_8', 'Ery_9', 'MEP_0', 'Mk_0']))[0]
cell_idx_granulocytes = np.where(oracle.adata.obs["louvain_annot"].isin(['GMPl_1', 'Gran_0', 'Gran_1', 'Gran_2', 'Gran_3']))[0]
cell_idx_monocytes = np.where(oracle.adata.obs["louvain_annot"].isin(['GMPl_0', 'Mo_0', 'Mo_1', 'Mo_2']))[0]
# Make dictionary to store the cell index list
index_dictionary = {"Whole_cells": None,
"Lineage_GM": cell_idx_Lineage_GMP,
"Lineage_ME": cell_idx_Lineage_MEP,
"Granulocytes": cell_idx_granulocytes,
"Monocytes": cell_idx_monocytes}
Pipeline Function
# 0. Define parameters
n_propagation = 3
n_neighbors=200
file_path = "Systematic_simulation_results_Paul_data.celloracle.hdf5" # Please use .hdf5 for extension.
def pipeline(gene_for_KO):
# 1. Simulate KO
oracle.simulate_shift(perturb_condition={gene_for_KO: 0},
ignore_warning=True,
n_propagation=3)
oracle.estimate_transition_prob(n_neighbors=n_neighbors, knn_random=True, sampled_fraction=1)
oracle.calculate_embedding_shift(sigma_corr=0.05)
# Do simulation for all conditions.
for lineage_name, cell_idx in index_dictionary.items():
dev = Oracle_development_module()
# Load development flow
dev.load_differentiation_reference_data(gradient_object=gradient)
# Load simulation result
dev.load_perturb_simulation_data(oracle_object=oracle, cell_idx_use=cell_idx, name=lineage_name)
# Calculate inner product
dev.calculate_inner_product()
dev.calculate_digitized_ip(n_bins=10)
# Save results in a hdf5 file.
dev.set_hdf_path(path=file_path)
dev.dump_hdf5(gene=gene_for_KO, misc=lineage_name)
Running pipeline only with Gata1 as example
%%time
# Test pipeline with Gata1 gene
pipeline(gene_for_KO="Gata1")
Load data with Oracle_systematic_analysis_helper.
helper = Oracle_systematic_analysis_helper(hdf5_file_path=file_path)
ps = helper.calculate_negative_ps_p_value(misc="Lineage_ME")
ps
-->
gene | ps_sum | |
---|---|---|
0 | Gata1 | -0.0 |
Hi! As mentioned @jfischerwue in the comment: https://github.com/morris-lab/CellOracle/issues/107#issuecomment-1468227091 I have had the same issue. If you're going to a bit deeper and find the deep part of this function, you could get how it is managing the data to get 0.0 PS.
In my case, Instead of using:
ps_sum = self.get_negative_PS_p_value(pseudotime=pseudotime, return_ps_sum=True, plot=False)[1]
I've used: ps_sum = self.get_negative_PS_p_value(return_ps_sum=True)[1]
After running the modified function, the output is: Out[140]: gene ps_sum_me 0 Onecut2 1.845766 1 Myt1l 1.827973 2 Hivep3 1.826264 3 Pou2f2 1.777027 4 Zbtb7c 1.758463 .. ... ... 211 Tead4 0.000991
Finally, we could get a scatter plot like this:
Best, Lorenzo
Hello,
Fantastic analysis package and documentation - I am very interested in delving deeper into its capabilities!
I would like to rank TFs based on their negative PS Sum values, like you presented in Figure 5a of the Nature publication. Would you be able to provide code for how you did this with the data in those figures? I would like to perform something similar for my dataset. Thank you very much and congrats on the publication!