theislab / scCODA

A Bayesian model for compositional single-cell data analysis
BSD 3-Clause "New" or "Revised" License
141 stars 23 forks source link

Understanding inclusion probability #67

Closed auesro closed 1 year ago

auesro commented 1 year ago

Dear scCODA team,

I have been using scCODA in my snRNAseq data to find out differentially distributed celltypes. I have read most of the documentation but given I am not a mathematician (or related) I have one question:

Corrected

I have 51 celltypes in my dataset. According to the previous plot, I would say that only clusters 29, 31 and 43 are differentially distributed most of the times. Accordingly, when quantifying the analysis using all celltypes as reference I obtain: Credible Which seems to confirm the previous conclusion. Following the results, I would argue that cluster 29 is very likely (at FDR 0.05) to be differentially distributed between my 2 groups (Control vs Experimental). Am I correct till here?

Now, the main question: Would it be correct to assume that given that the other cluster of cells (except 29, 31 and 43) show no differential distribution (all are at 0%) they could be used as reference type? In consequence, shouldn't I expect to see the clusters 29, 31 and 44 above the threshold lines when using the majority of clusters as reference in the first plot? Why is that not the case?

Thanks a lot!

Cheers,

A

johannesostner commented 1 year ago

Hi @auesro! The conclusions you draw from your applications of scCODA are correct - the inclusion probability of clusters 29, 31 and 43 are above the threshold of the specific run for a majority of reference clusters. This is what you see in the second plot. For example, cluster 31 is above the run-specific threshold in more than 70% of runs.

I'm not 100% sure if I understood your question correctly, but I'll try to explain what's going on in the first plot: The visualization in your first plot shows the inclusion probability (called IP from here on) of each cluster in all runs as dots, as well as the IP threshold for each run as a dashed line. This threshold is not fixed, but changes in each run depending on which clusters are found to be differentially abundant (For example, at FDR 5% the average IP of all selected clusters must be above 95%.) Each threshold line is therefore only relevant for the dots from the same run!

As an example, you cannot compare the IP of cluster 31 in the run with reference cluster 1 with the IP threshold of the run with reference cluster 17. If you plot each run separately, you will see that the IP threshold will be below the IP for cluster 31 about 75% of the time (the value in the second plot).

Regarding the reference cluster: We know that a clear reference (a cluster that is known to not be affected by the condition) is not always readily availabe. To avoid that, there's two solutions: Either using the automatic reference cluster selection, or doing runs with all references and aggregating the results, as you did. In the second case, you can assume each cluster that is selected in more than 50% of the runs (29, 31 and 43 in your case) to be differentially abundant. All clusters except 29, 31 and 43 would be good references and also give almost the same result (due to the natural variation in the data there can be small differences, thus e.g. cluster 31 might not be selected for every reference).

I hope that this answers your question!

auesro commented 1 year ago

Hi @johannesostner Thanks a lot for the explanation! I was missunderstanding the first plot due to the color palette used...I thought each "column" of dots represented the IP for all clusters when the cluster in the X axis was used as reference type, however what each "column" represents is the IP for the cluster in the X axis when all clusters are used as reference! Would it be possible to plot exactly what I understood the first plot was? I mean a plot where you can see the IP of each cluster (dots) when the cluster in X axis is used as reference type.

johannesostner commented 1 year ago

We don't have a dedicated function for producing this plot. Simply switching the hue and x parameters in your first plot might give you the plot you want, though. If this does not work, could you show me the code you use to generate this plot?

auesro commented 1 year ago

Of course, you were right. Just needed to switch the hue and x parameters in the seaborn scatterplot Thanks a lot! Completely solved!