Reproducing the data analysis?

timothy-barry commented 2 years ago

Thanks for this great work.

We are looking to replicate the portion of your computational pipeline enclosed by the red rectangle (below).

mimosca

Our starting point is (i) a binary gRNA-by-cell matrix of gRNA assignments and (ii) a gene-by-cell expression matrix of UMI counts. We want to compute a p-value to test the association between each gene and gRNA, as reported in Figure 4c of the paper (below).

p_vals

How can we use the code this repository to carry out this analysis, going from input (i.e., gRNA-by-cell indicator matrix, gene-by-cell expression matrix) to output (gene-by-gRNA matrix of p-values)? The linear_model.py script contains seven (non-plotting) functions: fit_lm, calc_corr_matrix, calculate_inertia, cluster_by_kmeans, shuffle_and_fit, calc_p_vals, and design_from_arrs. Can we use these seven functions to carry out the analysis? If so, could you please indicate the order in which we should call these functions to get the desired output from the input?

CFrangieh commented 2 years ago

Hey Tim

The function you want to start with is shuffle_and_fit followed by calc_p_vals. Given an equation of the form Y=XB it should look something like this:

null_distribution = shuffle_and_fit(X_mat, Y_mat, cov_ind, num_iters=10000)

Where cov_ind specifies the covariate you would like to calculate a null distribution for (calculate a separate null distribution for each covariate you are interested in). I used a bash script that calls this function in a loop for each covariate and then saves the null distributions, but this is not necessary if you are only interested in a few covariates.

You will then call calc_p_vals to calculate your vector of empirical p-values:

p_vals = calc_p_vals(B_mat, null_distribution, cov_ind)

Where null_distribution is the output of shuffle_and_fit and cov_ind should be the same cov_ind you use to call shuffle_and_fit. If you want to loop through all covariates, you could call each of these functions in each iteration of the loop and build your p-value matrix column-wise

timothy-barry commented 2 years ago

Thanks Chris! I'll try this and get back to you within a couple days.

timothy-barry commented 2 years ago

Hey Chris,

Thanks for your tips. They are very helpful. I cooked up an example script in which I generate synthetic gene and gRNA data and compute p-values to test the association between a single gRNA and all genes. I am importing your linear_model.py functions in a module called mimosca (see below).

import numpy as np
import mimosca

# define some constants
np.random.seed(10)
N_CELLS = 200
N_GRNAS = 10
N_GENES = 50

# create the synthetic gene expression matrix
gene_mat = np.random.poisson(lam = 1, size = (N_CELLS, N_GENES))

# define a vector of grna names ("grna_1", ..., "grna_N_GRNAS")
sgRNA_names_l = []
for i in range(0, N_GRNAS):
    sgRNA_names_l.append("grna_" + str(i))
sgRNA_names = np.array(sgRNA_names_l)

# define an array of gRNA assignments
sgRNA_assignments_int = np.random.randint(0, high=N_GRNAS, size=N_CELLS)
sgRNA_assignments_l = []
for i in range(0, N_CELLS):
    sgRNA_assignments_l.append(sgRNA_names[sgRNA_assignments_int[i]])
sgRNA_assignments = np.array(sgRNA_assignments_l)

# compute the design matrix from sgRNA_names and sgRNA_assignments
design_mat = mimosca.design_from_arrs(sgRNA_names, sgRNA_assignments)

# estimate the beta matrix using `mimosca.fit_lm`
B_mat = mimosca.fit_lm(design_mat, gene_mat)

# call `shuffle_and_fit` to compute the null distribution for "grna_0"
cov_ind = 0
null_distribution = mimosca.shuffle_and_fit(design_mat, gene_mat, cov_ind)

# calculate the p-values between "gRNA_0" and all genes
p_vals = mimosca.calc_p_vals(B_mat, null_distribution, cov_ind)

Did I get this right?

I also have a few remaining questions.

I assume that B_mat (in the final line) is supposed to be the output of mimosca.fit_lm. Is this correct?
MIMOSCA does not require me to label which gRNAs are negative controls and which are targeting. Is this correct?
The figure that I pasted from your paper indicates that there are EM algorithm and gRNA pairwise correlation steps in the pipeline. Should I implement these steps as well? Or have I accurately implemented MIMOSCA above?

CFrangieh commented 2 years ago

Hi Tim

This looks good to me.

Yes that is correct. B_mat is the regulatory coefficient matrix that is the output of fitting the linear model.
Yes also correct. The model does not discern between targeting and non-targeting sgRNAs. Because the linear regression fits an intercept (see documentation here), all cells collectively serve as the background or "control" distribution. The control sgRNAs can then be examined to get an idea of the noise in your coefficients.
You could implement EM separately. This function is from Dixit et al. - see the Github here. It is the function bayes_cov_col and should be implemented as: X_mat_EM = bayes_cov_col(Y_mat_df,X_mat_df,sgRNAs_for_EM,lm_object) Where Y_mat_df and X_mat_df are Pandas DataFrame versions of your y and x matrices (can convert with Y_mat_df = pd.DataFrame(Y_mat, columns=feature_names) and X_mat_df = pd.DataFrame(X_mat, columns=covariate_names), sgRNAs for EM is a list of sgRNA names for which you would like to run the EM algorithm, and the lm_object is the second output of the fit_lm function (I realized the version of fit_lm that was posted only returned coefficients, so I just updated it so that this function returns coefficients as the first output and the fit object itself as the second output). So you can adjust your code above to be: B_mat, lm_object = mimosca.fit_lm(design_mat, gene_mat)

timothy-barry commented 2 years ago

Thanks Chris. I'll get back to you shortly.

timothy-barry commented 2 years ago

Thanks again, Chris. I think this is all we need. I'm closing the issue now.

klarman-cell-observatory / Perturb-CITE-seq

Reproducing the data analysis? #4