theislab / scCODA

A Bayesian model for compositional single-cell data analysis
BSD 3-Clause "New" or "Revised" License
147 stars 24 forks source link

Looping over cell types to use as reference: discrepancy? #53

Open koenvandenberge opened 2 years ago

koenvandenberge commented 2 years ago

Hi All,

I am using scCODA for differential abundance analysis of a scRNA-seq dataset. If I perform a manual reference cell type selection using e.g. B-cells as reference, and another time using dendritic cells as reference, I find that in both cases, for example, T-cells are differentially abundant between conditions, at a permissive FDR of 0.4.

However, when looping over the different cell types to use each as a reference sequentially to check the stability of the results, I notice that the, e.g., T-cells are only significant once. How can this discrepancy be explained?

Here is the code for automatic/manual selection of these cell types (automatic selection selects another cell type than B-cells).

## automatic
cellCountData = dat.from_pandas(cellCounts, covariate_columns=["treatment"])
model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="automatic")
res = model.sample_hmc()
# no credible effect has been found for any cell population
res.summary()
# Some effect at higher FDR
res.set_fdr(est_fdr=0.4)
res.summary()

## manual
model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="B-cell")
resBCell = model.sample_hmc()
# no credible effect has been found for any cell population
resBCell.summary()
# Some effect at higher FDR
resBCell.set_fdr(est_fdr=0.4)
resBCell.summary()

To loop over the cell types, I am using the code from the vignette (slightly adapted to allow for permissive FDR).

cell_types = cellCountData.var.index
results_cycle = pd.DataFrame(index=cell_types, columns=["times_credible"]).fillna(0)

for ct in cell_types:
    print(f"Reference: {ct}")

    # Run inference
    model_temp = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type=ct)
    temp_results = model_temp.sample_hmc()

    # Select credible effects
    temp_results.set_fdr(est_fdr=0.4)
    cred_eff = temp_results.credible_effects()
    cred_eff.index = cred_eff.index.droplevel(level=0)

    # add up credible effects
    results_cycle["times_credible"] += cred_eff.astype("int")
johannesostner commented 2 years ago

Hi Koen,

scCODA uses a thresholding on the posterior inclusion probability of each cell type to decide whether they are differentially abundant. The nominal FDR approximately determines, where this decision boundary is placed.

Since inference in scCODA is done via HMC sampling, small differences in the posterior inclusion probability between runs can occur. I suspect that in your data, the posterior inclusion probability of T-cells fluctuates right around the decision boundary. In that case, two runs (regardless of looping over different references) can differ slightly.

I hope that this explains your observation!

johannesostner commented 2 years ago

One additional remark: A FDR of 0.4 is quite high - meaning that you expect 40% of your discovered effects to be false positives. If this was just to test the method, it's ok to do so. Please be aware of this though when analyzing a new dataset!

koenvandenberge commented 2 years ago

Thanks @johannesostner. Am I right to think that, as I increase the number of posterior samples, the between-run variability should decrease since the posterior is estimated with higher precision? I just re-ran an scCODA analysis using one particular cell type as reference, and across three runs I am observing quite different results every time (each time a different combination of significant cell types).

koenvandenberge commented 2 years ago

As another sanity check, when looping over all cell types to use as a reference, I've set est_fdr=1 in order to try and understand the scCODA results. I had expected all cell types to become significant, but this doesn't happen. In fact, each cell type is only significant once. Could this be expected?

johannesostner commented 2 years ago

Thanks @koenvandenberge for pointing these things out!

Yes, a larger number of posterior samples can decrease the between-run variability, but only up to a certain point. We observed some fluctuation in the significance of some cell types even after running longer chains before, but not to the extent that you describe here.

As for your second comment, this is a result that surprises me. An FDR of 1 should definitely lead to all cell types (except the reference) being significant.

I'm really curious what went wrong there. Could you please give me some info about your data (number of samples and cell types, the chain lengths you run, version of scCODA, tensorflow and tensorflow-probability)? Also, it would be very helpful to look at the acceptance rate of each run (printed at the end of the sampling or saved in result.sampling_stats["acc_rate"]), and the inclusion probabilities of the cell types (available in result.effect_df or as a print-out through result.summary_extended())

Thanks in advance!

koenvandenberge commented 2 years ago

Thanks @johannesostner, this is incredibly useful. Unfortunately, this is sensitive data and I cannot share it as such. But I will try to make a reproducible example for this.

I am using scCODA 0.1.7, tensorflow 2.8.0 and tensorflow_probability 0.16.0. The data consists of 10 samples (five samples in each of two groups), and 16 cell types. The number of cells per sample ranges from ~230 to ~4.5K.

Acceptance rates of the runs mentioned above are quite similar each time, ranging from 52-60%.

Hope to get back to you soon with a reproducible example.

koenvandenberge commented 2 years ago

Hi @johannesostner

Apologies for the delay in getting back to you. Below, I show some (RMarkdown) code using artificial data whose properties mimic my real data somewhat. The data is also attached as an rds (R) object. While I find celltype12 to be credible changing at 40% FDR (using this just for didactic purposes) while using either celltype6 or celltype11 as reference, when I loop over all cell types each cell type is credible once (except for one cell type), even at FDR of 1. This seems to conflict with the manual selection of celltype6 and celltype11, so I must be missing something.

Further, when re-evaluating the code when using celltype6 as reference, we see that celltype12 is consistently identfied as credibly changing. Usually at least one other cell type is also changing. However, which one is variable across re-evaluations of the code.

Let me know if anything is unclear or something else would be required. Thanks in advance for taking a look.

Import in R


# In R
# RMarkdown: ```{r}
cellCountsArtificial <- readRDS("cellCountsArtificialAll.rds")
reticulate::use_python("/opt/conda/bin/python")
# RMarkdown: ```

Analyze in Python

Set-up


# All following is in Python (using ```{python})

# RMarkdown: ```{python}

import pandas as pd
import pickle as pkl
import matplotlib.pyplot as plt

from sccoda.util import comp_ana as mod
from sccoda.util import cell_composition_data as dat
from sccoda.util import data_visualization as viz

import sccoda.datasets as scd
import sccoda as scc
import arviz as az

import tensorflow as tf
import tensorflow_probability as tfp

Automatic selection of reference population

Here, we find no significant cell types, except for high FDR values.


# get object from R
cellCounts = r.cellCountsArtificial
# convert to anndata
cellCountData = dat.from_pandas(cellCounts, covariate_columns=["treatment"])
print(cellCountData)

viz.boxplots(cellCountData, feature_name="treatment")
plt.show()

model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="automatic")
res = model.sample_hmc()
# no credible effect has been found for any cell population
res.summary()
# Some effect at higher FDR
res.set_fdr(est_fdr=0.4)
res.summary()
res.set_fdr(est_fdr=1)
res.summary()

Manual selection of reference cell type: celltype6

celltype12 consistently identified at 40% FDR. Other significant cell types variable upon re-running code.


# get object from R
cellCounts = r.cellCountsArtificial
# convert to anndata
cellCountData = dat.from_pandas(cellCounts, covariate_columns=["treatment"])
print(cellCountData)

viz.boxplots(cellCountData, feature_name="treatment")
plt.show()

model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="celltype6")
res = model.sample_hmc()
res.summary()
# Some effect at higher FDR
res.set_fdr(est_fdr=0.4)
res.summary()

Manual selection of reference cell type: celltype11

celltype12 again consistently identified at 40% FDR. Other significant cell types variable upon re-running code.


# get object from R
cellCounts = r.cellCountsArtificial
# convert to anndata
cellCountData = dat.from_pandas(cellCounts, covariate_columns=["treatment"])
print(cellCountData)

viz.boxplots(cellCountData, feature_name="treatment")
plt.show()

model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="celltype11")
res = model.sample_hmc()
res.summary()
# Some effect at higher FDR
res.set_fdr(est_fdr=0.4)
res.summary()

Looping over all populations as reference

However, here, each cell type is found to be credibly changing only once. This seems to conflict with the previous observations.


cell_types = cellCountData.var.index
results_cycle = pd.DataFrame(index=cell_types, columns=["times_credible"]).fillna(0)

for ct in cell_types:
    print(f"Reference: {ct}")

    # Run inference
    model_temp = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type=ct)
    temp_results = model_temp.sample_hmc()

    # Select credible effects
    temp_results.set_fdr(est_fdr=1)
    cred_eff = temp_results.credible_effects()
    cred_eff.index = cred_eff.index.droplevel(level=0)

    # add up credible effects
    results_cycle["times_credible"] += cred_eff.astype("int")

# Calculate percentages
results_cycle["pct_credible"] = results_cycle["times_credible"]/len(cell_types)
results_cycle["is_credible"] = results_cycle["pct_credible"] > 0.5
print(results_cycle)
# RMarkdown: ```

cellCountsArtificialAll.rds.zip

johannesostner commented 2 years ago

Hi @koenvandenberge!

Thanks for taking the time to prepare a mock dataset. I've looked at the data and your analysis - here's what's going on: Looking at the relative abundances in the data, we can see that most cell types do not change much, except maybe for celltype12: image

Setting FDR to 0.4

As I mentioned earlier in the discussion, scCODA uses the posterior inclusion probabailities (PIPs) of the cell types to calculate a decision boundary. This boundary is placed such that the average rejection probability (1-PIP) of all selected cell types is less than the desired FDR.

Taking 10 runs with celltype6 as the reference, all cell types but celltype12 (~0.75) and celltype6 (reference - cannot be differentially abundant) have a PIP around 0.5 in all runs, with some fluctuation due to the MCMC sampling (dots in the plot below). Therefore, when taking FDR=0.1, nothing will be DA. When taking FDR=0.3, only celltype12 will show up. An FDR of 0.4 will then place the decision boundary (dashed lines) exactly inside the bunch of plots, resulting in different cell types being selected (i.e. above the decision boundary) depending on the random variations in the PIP: image When using celltype11 as the reference, the same happens.

Looping over different references

Here, I feel like you misinterpreted the results. I ran your code (using each cell type as reference with FDR=1) and get this result, telling me that every cell type was selected in all runs but one: image This is the expected result, since a cell type cannot be selected if it is the reference, but the FDR of 1 forces all other cell types to be selected each time.

I hope that this answers your questions. For reference, you can find my code in the attached notebook as well. If you have more questions, don't hesitate to continue this discussion!

test_references_May22.ipynb.zip

koenvandenberge commented 2 years ago

Thanks so much. I agree with all your points, and the figure on inclusion probabilities is incredibly insightful!

It does look like we are having a discrepancy when looping over the different cell types. I agree that if I would have had the result you have, my interpretation would have been wrong. However, these are the results I am getting, and this is what actually sparked my concerns initially... So, in the end, could all of this be due to a legacy version issue? Or is something wrong with my code of looping over cell types?

sccLoop
johannesostner commented 2 years ago

I tried running the code you posted again and still get the same result as before. In the screenshot you just posted, it looks like only the result of the very last run (reference celltype16) is taken into account in the column times_credible instead of the sum of all runs.

I have a few ideas why this might have happened, none of them directly related to scCODA, and admittedly all very unlikely.

  1. This could be the result of a simple indentation bug - un-indenting the line results_cycle["times_credible"] += cred_eff.astype("int") gives me the exact same result that you have (although I assume that you just copy-pasted your code, which has the line indented correctly, into this issue)
  2. Some bug caused by old versions of numpy or pandas in either my or your environment. Could you tell me your version numbers of these packages?
  3. I am running the code in a jupyter notebook instead of RMarkdown and reticulate. Highly unlikely, but this might also be the cause here

Could you please check at least points 1 and 2? Thanks!

koenvandenberge commented 2 years ago

Hi @johannesostner

Just already letting you know that I have now run your shared code in a Jupyter Notebook and I indeed get similar results. I am working in a container with fixed software versions, so it definitely is not a version issue. Will investigate further.