statOmics / satuRn

satuRn is a highly performant and scalable method for performing differential transcript usage analyses.
https://statomics.github.io/satuRn/
20 stars 1 forks source link

FDR control and use of raw counts #19

Closed ChiaraCaprioli closed 1 year ago

ChiaraCaprioli commented 2 years ago

Hi,

I am using satuRn on a single-cell dataset to score DTU among groups of interest. Everything seems to work fine, but for the fact I obtain very few significant empirical FDR values in all contrasts I tried. I understand this might be due to both biological and technical reasons, especially zero-inflation and shallow sequencing, so I am checking potential ways to minimize this behavior.

Did I get correctly that in your model it is ok using raw count estimates as input data (i.e., not quantified by tools such as Alevin)? Should I expect any influence regarding this?

Thanks for your help

jgilis commented 2 years ago

Hi Chiara,

Yes, we indeed suggest using "raw" counts as input for satuRn, by which we simply mean non-normalized counts. If you import count using the tximport package, this basically means importing with the default setting for the countsFromAbundance argument, which is "no" (resulting in raw counts). Since setting other options for this argument like "scaledTPM" didnt have a pronounced effect on performance in our benchmark, we prefer working with the raw counts.

Such raw counts can be obtained from all quantifiers, alevin should be no exception.

Your reasoning is correct; this simply means that for the moment you are not able to pick up biological signal, which may have (at least) three reasons: (1) there isnt much biological signal altogether, (2) the biological signal is simply hidden by the technical noise or (3) the empirical correction by satuRn is too stringent.

If (1) is the case, there obviously is not much you can do.

To alleviate (2), you could consider a more stringent transcript-level filtering criterion. You could consider the edgeR::filterByExpression and DRIMSeq::dmFilter functions. To give you an idea, for our benchmarks I used

filterByExp(min.count = 1,  min.total.count = 0,  large.n = 0, min.prop = 0.5) 
# n set to zero to force using min.prop

dmFilter(min_samps_feature_expr = ns/2, 
               min_feature_expr = 10, 
               min_samps_feature_prop=ns/2, 
               min_feature_prop=0.1,
               min_samps_gene_expr=ns, 
               min_gene_expr=10) 
# with ns the size of the smallest group in the comparison

as lenient and stringent filtering criteria, as described in the Methods of the paper.

Finally, issue 3. In all our benchmarks, empirical correction was vital to keep control of the type 1 error, and the correction did also a decent job as to not overcorrect. However, we only benchmarked this on three real single-cell datasets, so it may very well be that for your data something is going wrong. Therefore, I would really look at the two diagnostic plots in the testDTU function. For diagplot1, does the blue dashed curve nicely follow your test statistics, and are your test statistics nicely bell shaped? For diagplot2, does the dark green curve follow the bulk/mid portion of your test statistics? Note that more stringent filtering could alleviate such issues as well.

Kind regards,

Jeroen

ChiaraCaprioli commented 1 year ago

Hi Jeroen,

thank you for your quick reply and explanation. I basically followed the same list of 'issues' to troubleshoot what's going on in my dataset (human bone marrow, full-length cDNA), so I am reporting here the flow with some thoughts / questions.

1) 'there isn't much biological signal altogether'. Since for now I am only interested in technical bias, I tried to minimize the issue by contrasting distant cell lineages, which should differ a lot in terms of gene expression and isoform usage. For example, hematopoietic stem cells (HSC) vs T cells.

2) ' the biological signal is simply hidden by the technical noise'. For this, I tried both functions you suggest, with both lenient and stringent criteria:

  1. edgeR, criteria as in satuRn paper
    
    filter_edgeR.1 <- filterByExpr(counts1,
                              design = NULL,
                              group = AML$aggr_lineage,
                              lib.size = NULL,
                              min.count = 1, 
                              min.total.count = 0,
                              large.n = 0, 
                              min.prop = 0.5) 

table(filter_edgeR.1) #FALSE 5541, TRUE 15608

counts1 <- counts1[filter_edgeR.1, ]

Update txInfo according to the filtering procedure

txInfo <- txInfo[which(txInfo$isoform_id %in% rownames(counts1)), ]

remove txs that are the only isoform expressed within a gene (after filtering)

txInfo <- subset(txInfo, duplicated(gene_id) | duplicated(gene_id, fromLast = TRUE)) counts1 <- counts1[which(rownames(counts1) %in% txInfo$isoform_id), ] txInfo <- txInfo[match(rownames(counts1), txInfo$isoform_id), ]

create design matrix

AML$group <- factor(AML $aggr_lineage, levels = unique(AML $aggr_lineage))

generate SummarizedExperiment

sumExp <- SummarizedExperiment::SummarizedExperiment( assays = list(counts = counts1), colData = AML, rowData = txInfo )

specify design formula from colData

metadata(sumExp)$formula <- ~ 0 + as.factor(colData(sumExp)$group)

fit quasi-binomial generalized linear models

system.time({ sumExp <- satuRn::fitDTU( object = sumExp, formula = ~ 0 + group, parallel = FALSE, BPPARAM = BiocParallel::bpparam(), verbose = TRUE ) })

Set up contrast matrix

group <- as.factor(AML $group) design <- model.matrix(~ 0 + group) colnames(design) <- levels(group)

L <- limma::makeContrasts( Contrast1 = HSC_progenitors - T_NK, Contrast2 = HSC_progenitors - myeloid, levels = design )

perform the test

sumExp <- satuRn::testDTU( object = sumExp, contrasts = L, diagplot1 = TRUE, diagplot2 = TRUE, sort = TRUE #according to the empirical p-values )

![Screenshot 2022-07-08 at 10 26 09](https://user-images.githubusercontent.com/73837526/177951695-9e908af6-9bb1-4902-bf56-3a882cb67a6e.png)
![Screenshot 2022-07-08 at 10 26 16](https://user-images.githubusercontent.com/73837526/177951713-314b0f72-d229-40d7-8d6b-7a47fcf24b4f.png)

inspect result in Contrast 1

DTU_result <- rowData(sumExp)[["fitDTUResult_Contrast1"]] %>% rownames_to_column(var = 'isoform_id') length(which(DTU_result$empirical_FDR < 0.1)) #2

visualize DTU

group1 <- rownames(colData(sumExp))[colData(sumExp)$group == "HSC_progenitors"] group2 <- rownames(colData(sumExp))[colData(sumExp)$group == "T_NK"]

satuRn::plotDTU( object = sumExp, contrast = "Contrast1", groups = list(group1, group2), coefficients = list(c(1,0,0,0,0), c(0,0,1,0,0)), summaryStat = "model", transcripts = NULL, top.n = 3)

![Screenshot 2022-07-08 at 11 02 50](https://user-images.githubusercontent.com/73837526/177957655-e8988217-db13-43ee-bbc8-f00c4af2e68d.png)
![Screenshot 2022-07-08 at 11 02 57](https://user-images.githubusercontent.com/73837526/177957622-507a5600-711f-4cd5-929e-173be5d8ce53.png)
![Screenshot 2022-07-08 at 11 03 04](https://user-images.githubusercontent.com/73837526/177957691-f680062a-d30d-4973-b44d-a9324cdfd767.png)

2.  edgeR, custom criteria (more stringent) 

filter_edgeR.2 <- filterByExpr(counts1, design = NULL, group = AML$aggr_lineage, lib.size = NULL, min.count = 5, min.total.count = 0, large.n = 0, min.prop = 0.5)

table(filter_edgeR.2) #FALSE 20725, TRUE 424

which, following the same as above, yielded:

sumExp <- satuRn::testDTU( object = sumExp, contrasts = L, diagplot1 = TRUE, diagplot2 = TRUE, sort = TRUE #according to the empirical p-values )

![Screenshot 2022-07-08 at 10 37 07](https://user-images.githubusercontent.com/73837526/177952902-1a9fc910-4049-48c5-87f2-70eaf08e5fe3.png)
![Screenshot 2022-07-08 at 10 37 14](https://user-images.githubusercontent.com/73837526/177952940-afc2e048-a237-4c3b-9cdb-23e9c14c23e2.png)

DTU_result <- rowData(sumExp)[["fitDTUResult_Contrast1"]] %>% rownames_to_column(var = 'isoform_id') length(which(DTU_result$empirical_FDR < 0.1)) # 0


3.  DRIMSeq, criteria as in satuRn paper

filter_dm.1 <- dmFilter( d, min_samps_feature_expr = ns/2, min_feature_expr = 10, min_samps_feature_prop=ns/2, min_feature_prop=0.1, min_samps_gene_expr=ns, min_gene_expr=10 )

filter_dm_names.2 <- names(filter_dm.2@counts) # genes counts2 <- counts2[which(counts2$gene_id %in% filter_dm_names.2),]

length(counts2$feature_id) # 279

perform the test

sumExp <- satuRn::testDTU( object = sumExp, contrasts = L, diagplot1 = TRUE, diagplot2 = TRUE, sort = TRUE #according to the empirical p-values )

![Screenshot 2022-07-08 at 10 46 25](https://user-images.githubusercontent.com/73837526/177954537-b6f78648-10e1-4773-a0cb-0f9ba46e7f56.png)
![Screenshot 2022-07-08 at 10 46 31](https://user-images.githubusercontent.com/73837526/177954568-363990ef-52dc-42d2-a5cf-3b2c97713ecb.png)

DTU_result <- rowData(sumExp)[["fitDTUResult_Contrast1"]] %>% rownames_to_column(var = 'isoform_id') length(which(DTU_result$empirical_FDR < 0.1)) # 1


4.  DRIMSeq, custom criteria (lenient)

filter_dm.2 <- dmFilter(d, min_samps_gene_expr=0, min_samps_feature_expr = 0, min_samps_feature_prop=ns/20, # 5% min_gene_expr=0, min_feature_expr = 0, min_feature_prop=0.05)

filter_dm_names.2 <- names(filter_dm.2@counts) # genes counts2 <- counts2[which(counts2$gene_id %in% filter_dm_names.2),] length(counts2$feature_id) # 21128

perform the test

sumExp <- satuRn::testDTU( object = sumExp, contrasts = L, diagplot1 = TRUE, diagplot2 = TRUE, sort = TRUE #according to the empirical p-values )

![Screenshot 2022-07-08 at 10 53 10](https://user-images.githubusercontent.com/73837526/177955798-f0bf7856-f8b7-48de-aff1-e7e68faa27fd.png)
![Screenshot 2022-07-08 at 10 53 16](https://user-images.githubusercontent.com/73837526/177955842-2b500767-afd1-4594-ad7c-070f23611840.png)

DTU_result <- rowData(sumExp)[["fitDTUResult_Contrast1"]] %>% rownames_to_column(var = 'isoform_id') length(which(DTU_result$empirical_FDR < 0.1)) # 10

satuRn::plotDTU( object = sumExp, contrast = "Contrast1", groups = list(group1, group2), coefficients = list(c(1,0,0,0,0), c(0,0,1,0,0)), summaryStat = "model", transcripts = NULL, top.n = 3)


![Screenshot 2022-07-08 at 10 55 20](https://user-images.githubusercontent.com/73837526/177956337-092b11ba-3538-4d58-af10-3f43fbff8f73.png)
![Screenshot 2022-07-08 at 10 55 31](https://user-images.githubusercontent.com/73837526/177956370-d1fcb50a-6771-4dbd-a9f5-33728d39b6e6.png)
![Screenshot 2022-07-08 at 10 55 38](https://user-images.githubusercontent.com/73837526/177956387-b2297811-bf43-42d6-8ca7-48a18546d39d.png)

3) 'the empirical correction by `satuRn` is too stringent'.
I don't think this is actually a problem, as generally in my data empirical FDR is quite similar to regular FDR, so it doesn't seem there is overcorrection.

Based on what I shared, my understanding is that keeping lenient filtering criteria ensures a high number of isoforms enter the analysis, which is somewhat beneficial at the stage of performing the test (as far as I can get from the diagnostic plots, which indeed appear with a nice bell shape and little difference between real data and empirically adjusted test statistics). However, this also implies that in the count matrix there are many more cells with shallow or no coverage; I wonder if this is the reason why I see such behavior in the violin plots of most significant DTUs, which I interpret as binary usage of the transcript. 

Sorry for the long post, and thanks a lot for any clarification / suggestion you could provide

Chiara
jgilis commented 1 year ago

Before I reply in more detail, just a quick question; what kind of sequencing protocol was used to generate this data? I am asking because the data looks very sparse. When going fromfilter_edgeR.1 to filter_edgeR.2, which only differ in min.count going from 1 to 5, you go from retaining 15.6K to 424 transcripts, which essentially means that many of your counts above 1 are also below 5. In addition, when looking at dot size of the DTU plots, it is clear that the expression of the target gene is very low in almost all cells, with usually only one or two cells with a high expression. This is a bit suspicious... Also, could you tell me approximately how many cells you have in your different values for the group variable?

Finally, it seems something went wrong in the DTU plots. They only show usage in one of the two groups. I have to admit that this is not the most user-friendly function, but I also don't really see where the mistake could be, the code looks correct...

ChiaraCaprioli commented 1 year ago

Hi again,

These counts have been generated by Nanopore sequencing of full-length cDNA obtained from 10x Chromium, and I am using raw counts for this analysis; so the sparsity seems quite expected to me.

For contrast 1 I showed, I have HSC = 2124 and T cells = 202.

And yes, DTU plots are really weird..since the parameter 'coefficients' is the only one I can actively mess up, below I report the contrast matrix I used to set coefficients = list(c(1,0,0,0,0), c(0,0,1,0,0)) for Contrast1: Screenshot 2022-07-08 at 13 30 20

Thanks for your availability!

jgilis commented 1 year ago

Just one more quick question; wat is the size of the smallest group of cells?

ChiaraCaprioli commented 1 year ago
table(sAML1$aggr_lineage)

Screenshot 2022-07-08 at 14 10 46

jgilis commented 1 year ago

First of all, satuRn (and potentially none of the other DTU methods, as far as I know) has not been formally benchmarked on data from 10X or other microfluidics platforms, nor on long read data. For instance, all benchmarks in our paper are either on bulk data or on SMART-seq2 single-cell data.

Since your data will be much sparser, the signal to noise ratio will obviously be much lower, and you will have low power to pick up any signal. Making the filtering more stringent will help, but the downside is that the empirical correction needs sufficient features to work properly; I'd say ±500 features is on the lower edge.

The problem induced by the low signal to noise ratio is actually a bit subtle. I can explain on this figure:

image

The usages of all cells but one are zero, but for 1 cell its about 20%. However, this cell has a high gene count. This means this observation will be very influential for fitting the quasi-binomial GLM in satuRn. The expected average usage of the transcript will therefore probably be close to 20%, but with a very large standard error. If you have huge standard errors in both groups (which we cant see yet on the plot but which is probably the case), you wont have power to infer differential usage.

In attempt to alleviate the issue, I would try to first only analyse the two most abundant cell types. Then for filtering, I'd try to tailor the parameters in such a way that you retain 1K-2K transcripts; a middle ground between stringent and retaining enough features for the empirical correction.

W.r.t. the plotting, the coefficients is indeed the most tricky to specify but it should indeed be correct here 💯 . It is also a bit suspicious that even though you specify summaryStat = "model" the expected value for the usage is not displayed on the plot. One thought; is the group2 a vector of column names that are all present in the column names/colData of the SummarizedExperiment object?

I'll try to reproduce the plotting issue in the next few working days!

jgilis commented 1 year ago

Hi @ChiaraCaprioli ,

I think I found the issue with the plotDTU function. I think this happens when there are no counts for the transcript (or actually, any of the transcripts in the corresponding gene) in one of the groups, but the transcript (gene) still passes filtering. Could you check if that is correct, i.e., if all transcripts of the gene (of the transcript you want to visualize) have zero expression in this 2nd group?

Kind regards,

Jeroen

jgilis commented 1 year ago

And if this is the case, can you try to uninstall your bioconductor version of satuRn and install it again from github with

remove.packages("satuRn")
# open new session
devtools::install_github("statOmics/satuRn")
packageVersion("satuRn") # should then be 1.5.1 [edit: 1.5.2 now]

and see if the problem is solved (or if new problems arise). If all went well, you should now get an error for such transcripts when running plotDTU, which is the desired behaviour in this case, stating "The requested transcripts/genes could not be retrieved from the provided data or had NA results.".

jgilis commented 1 year ago

Hi Chiara,

I have now investigated and resolved the underlying cause of the issue that you raised. It was an underlying issue present in the testDTU function, where NA usages were handled incorrectly. I explain this in more detail in the NEWS of our package, which I will copy-paste here as well:


We report a bug in satuRn 1.4.0. (Bioconductor release 3.15). The bug was inadvertently introduced in satuRn 1.3.1 (from the former Bioconductor devel). Note that the bug was not thus present in any of the older Bioconductor releases 3.13 and 3.14 (satuRn 1.0.x, 1.1.x and 1.2.x).

Bug details:

Imagine a gene with three isoforms and two cell types. The goal is to assess DTU between cell types. All isoforms are expressed in all cells of cell type 1. However, none of the isoforms are expressed in any of the cells in cell type 2 (i.e., the gene is not expressed in cell type 2).

satuRn computes the log-odds of picking a certain isoform from the pool of isoforms in each cell type, and then compares these log-odds estimates between the cell types. However, in this example, the log-odds of picking a certain isoform from the pool of isoforms in cell type 2 cannot be computed, as there is no data. Hence, the DTU test statistic should be NA. However, due to erroneous handling of NA estimates, which was inadvertently introduced in satuRn 1.3.1. while aiming to resolve github issue 16, the log-odds in cell type 1 will be compared to zero. Hence, (erroneous) results can be obtained for this contrast, even when there are no data in cell type 2.

Note that in many cases such isoforms may not pass filtering and would not get evaluated altogether. However, when analyzing sprase scRNA-Seq datasets with a lenient filtering criterium, this problem will apply, and will result in mistakes in the inference.


The bug has been resolved in the newer versions of satuRn (1.4.1 and up). Therefore, satuRn 1.4.0. should no longer be used and updated to the newer version.

Thank you for pointing me to the issue, I had not noticed it since I mainly work with bulk and plate-based data which is much less sparse, and I usually filter quite stringently.

Feel free to reach out in case of any other issues, suggestions or questions.

Best regards,

Jeroen

ChiaraCaprioli commented 1 year ago

Thank you very much! This is really helpful, I will give a try to v1.4.1 and see what happens.

Regards

Chiara