markrobinsonuzh / cytofWorkflow

MIT License
14 stars 3 forks source link

Cell populations in topTable() wrong #30

Closed LazDaria closed 1 year ago

LazDaria commented 1 year ago

Hi,

first I want to thank you for this very useful package! I am working with spatial experiment objects comprising IMC data and was struggling a bit to transform my SPE into a catalyst compatible SCE object. I managed by creating a new SCE object based on my count matrix, colData, clusters and reducedDims and adding experiment_info to the metadata slot. Using this newly created SCE, I worked through the CyTOF workflow and everything worked fine until the differential abundance section. For some reason, I obtain NA values for my p-values and the proportions returned by

topTable(da_res1, show_props = TRUE, format_vals = TRUE, digits = 2) 

differ from the true proportions which I obtained by:

p <- plotAbundances(sce, k = "celltype", by = "cluster_id", shape_by = "sample_id")
p$data$Freq

Here is how I performed the DA testing:

ei <- metadata(sce)$experiment_info
design <- createDesignMatrix(ei, cols_design = "condition")
contrast <- createContrast(c(0,1))

da_res1 <- diffcyt(sce, 
    design=design, 
    contrast = contrast,
    analysis_type = "DA", method_DA = "diffcyt-DA-edgeR",
    clustering_to_use = "celltype", 
     #min_cells=1, min_samples=1, 
    verbose = T)

I initially thought that the reason for this might be that I only have very few cells for some samples per cluster and conditon, as I am comparing clusters within a metaclusters. However, setting min_cells and min_samples to a lower value did not fix the problem. I am new to DA testing, but I compared the outputs using HDCytoData to my data and found no difference. I also sent you the generated SCE object per email. Thanks so much already!

markrobinsonuzh commented 1 year ago

Really hard to tell without more information. Probably there is a difference in the proportions from plotAbundances and in topTable because a normalization is applied to the latter. I would advise to take the (cluster/celltype-sample) count table and run the DA analyses direct on that, but more or less following the steps of diffcyt does. At least, that way, you can troubleshoot all the intermediate steps.

LazDaria commented 1 year ago

Thank you for your quick reply!

I will try to provide more information. In the figure below, which I generated according to the cytof workflow, you can see the cluster proportions per sample for two conditions (i.e. timepoints):

plotAbundances(sce, k = "celltype", by = "sample")

Cluster proportions per samples collected at two different timepoints

Same thing shown in boxplots:

plotAbundances(sce, k = "celltype", by = "cluster_id", shape_by = "sample_id")

Cluster abundance per sample and condition

It looks like there is a problem with the design or contrast matrix or with the experiment_info. Because everything is correct until I get to the DA testing:

metadata(sce)$experiment_info <- as.data.frame(colData(sce)) %>% group_by(sample_id,condition) %>% 
    summarise(n_cells=n(),.groups = 'drop') %>%
    as.data.frame()

ei <- metadata(sce)$experiment_info
design <- createDesignMatrix(ei, cols_design = "condition")
contrast <- createContrast(c(0,1))

da_res1 <- diffcyt(sce, 
    design=design, 
    contrast = contrast,
    analysis_type = "DA", method_DA = "diffcyt-DA-edgeR",
    clustering_to_use = "celltype", 
    min_cells=1, min_samples=1, 
    verbose = T)

rowData(da_res1$res) 

This returns:

output_diffcyt

For some reason, the p value for clusters 7 to 13 is NA. And interestingly, the function topTable() return zero for the proportions of exactly those clusters across all samples, which, as can be seen in the figures above, is not correct.

topTable(da_res1, show_props = TRUE, format_vals = TRUE, digits = 2)

output_topTable

Here is a link to the SCE object.

Maybe I am overseeing something obvious...

Hope this was informative enough. Thank you so much for your help!

markrobinsonuzh commented 1 year ago

Thanks for the extra info; quite a lot of variability there!

If it were me, I would want to step through the testDA_edgeR function and just double check that everything going in is ok .. design matrix, count table, etc. .. check the intermediate objects.

I'm going on vacation now for 2 weeks, but could check when I get back.

LazDaria commented 1 year ago

I now tested it with edgeR:

library(edgeR)

abundances <- table(metadata(sce)$cluster_codes$celltype, sce$sample_id) 
abundances <- unclass(abundances) 
extra.info <- colData(sce)[match(colnames(abundances), sce$sample_id),]
y.ab <- DGEList(abundances, samples=extra.info)

design <- model.matrix(~factor(condition), y.ab$samples)
y.ab <- estimateDisp(y.ab, design, trend="none")

fit.ab <- glmQLFit(y.ab, design, robust=TRUE, abundance.trend=FALSE)
res <- glmQLFTest(fit.ab, coef=ncol(design))

topTags(res)

With edgeR I get reasonable results: edgeR_results

So, it seems to be related to diffcyt or how I input my data to diffcyt. I will just use edgeR directly for now and maybe ask in the diffcyt github repo in parallel.

Have a nice vacation! Thank you for your support!

SamGG commented 10 months ago

The count per cellType is incorrect, which impacts the next calculations.

# From initial object
> table(sce$cluster_id)
 celltype_1 celltype_10 celltype_11 celltype_12 celltype_13  celltype_2  celltype_3 
       8684         632         670         129        7292        3193        1765 
 celltype_4  celltype_5  celltype_6  celltype_7  celltype_8  celltype_9 
      19786        3071        1065        6426        2505        1188
# From diffcyt output
> rowSums(da_res1$d_counts@assays@data$counts)
 celltype_1 celltype_10 celltype_11 celltype_12 celltype_13  celltype_2  celltype_3 
       9316           0           0           0           0         670       18154 
 celltype_4  celltype_5  celltype_6  celltype_7  celltype_8  celltype_9 
       7292       19786        1188           0           0           0 
SamGG commented 10 months ago

Found results similar to @LazDaria edgeR results when running code line by line but excluding the following line. I excluded it because I did not catch what it is for. Maybe very specific to the use of FlowSOM. Excluding it results in no zero count per celltype (table above) and ensures that code_id == cluster_id. HTH https://github.com/lmweber/diffcyt/blob/39348fd4ebba6a2ca5c38325e587d5c9047f3fda/R/diffcyt_wrapper.R#L373

> rowData(res)
DataFrame with 13 rows and 6 columns
             cluster_id     logFC    logCPM        LR       p_val       p_adj
               <factor> <numeric> <numeric> <numeric>   <numeric>   <numeric>
celltype_1  celltype_1  -0.993162   18.5858  4.343090 3.71594e-02 1.55842e-01
celltype_10 celltype_10 -2.377343   14.3412  3.777920 5.19332e-02 1.55842e-01
celltype_11 celltype_11 -0.203035   15.6122  0.061779 8.03706e-01 8.56278e-01
celltype_12 celltype_12  4.067186   16.2602 24.631513 6.94074e-07 9.02296e-06
celltype_13 celltype_13 -1.527985   14.4500  1.195690 2.74185e-01 4.45550e-01
...                 ...       ...       ...       ...         ...         ...
celltype_5   celltype_5  0.246238   16.4777 0.1939437    0.659654    0.856278
celltype_6   celltype_6  0.835572   16.0875 2.4257423    0.119357    0.258606
celltype_7   celltype_7 -1.807111   14.9109 1.8789839    0.170450    0.316549
celltype_8   celltype_8 -0.172904   15.6040 0.0671972    0.795462    0.856278
celltype_9   celltype_9 -0.523879   14.6839 0.3293972    0.566014    0.817576
> rowSums(d_counts@assays@data$counts)
 celltype_1 celltype_10 celltype_11 celltype_12 celltype_13  celltype_2  celltype_3 
       8684         632         670         129        7292        3193        1765 
 celltype_4  celltype_5  celltype_6  celltype_7  celltype_8  celltype_9 
      19786        3071        1065        6426        2505        1188