NA in proportion estimats

kvittingseerup commented 7 years ago

Hi

I'm experimenting a bit with filtering because I would the the fitted proportions to be the "true" proportions of the gene expression - not proportion of only transcripts passing filtering since that would make the downstream interpretation of the proportions easier. Have you any thoughts on this?

During this filtering process I used the example data:

### Set up R session
library('DRIMSeq')
library('PasillaTranscriptExpr')
data_dir  <- system.file("extdata", package = "PasillaTranscriptExpr")
pasilla_metadata <- read.table(file.path(data_dir, "metadata.txt"), header = TRUE, as.is = TRUE)
pasilla_counts <- read.table(file.path(data_dir, "counts.txt"), header = TRUE, as.is = TRUE)

pasilla_samples <- data.frame(sample_id = pasilla_metadata$SampleName, group = pasilla_metadata$condition)
levels(pasilla_samples$group)
d <- dmDSdata(counts = pasilla_counts, samples = pasilla_samples)

gene_id_subset <- readLines(file.path(data_dir, "gene_id_subset.txt"))
d <- d[names(d) %in% gene_id_subset, ]

And extracted proportions both with and without filtering:

### With filtering
d1 <- dmFilter(d, min_samps_gene_expr = 7, min_samps_feature_expr = 3, min_gene_expr = 10, min_feature_expr = 10)
d1 <- dmPrecision(d1, design = design_full)
d1 <- dmFit(d1, design = design_full, bb_model=TRUE)
t1 <- dmTest(d1, coef = "groupKD")

### Without filtering
d2 <- dmPrecision(d, design = design_full)
d2 <- dmFit(d2, design = design_full, bb_model=TRUE)
t2 <- dmTest(d2, coef = "groupKD")

## Get fitted proportions
p1 <- proportions(d1)
p2 <- proportions(d2)

And then I noticed NAs:

> sum(is.na(p2))
[1] 273

With a specific example bing the transcript: FBtr0100282

### Counts
> d2@counts@unlistData[which(rownames(d2@counts@unlistData) == 'FBtr0100282'),1:4]
GSM461176 GSM461177 GSM461178 GSM461179 
1352.950   593.985   404.469  2118.479 
>         d1@counts@unlistData[which(rownames(d1@counts@unlistData) == 'FBtr0100282'),1:4]
GSM461176 GSM461177 GSM461178 GSM461179 
1352.950   593.985   404.469  2118.479 

### Proportions
> subset(p1, p1$feature_id == 'FBtr0100282')[,1:4]
gene_id  feature_id GSM461176 GSM461177
50 FBgn0034420 FBtr0100282 0.2058465 0.2058465

> subset(p2, p2$feature_id == 'FBtr0100282')[,1:4]
gene_id  feature_id GSM461176 GSM461177
68 FBgn0034420 FBtr0100282        NA        NA

And in fact all proportions of the parrent gene FBgn0034420 are NA:

> all(is.na( subset(p2, p2$gene_id == 'FBgn0034420')[,3:ncol(p2)] ))
[1] TRUE

And the test result is also NA:

> subset( results(t1, level = "feature"), feature_id=='FBtr0100282')
       gene_id  feature_id        lr df    pvalue adj_pvalue
50 FBgn0034420 FBtr0100282 0.3580878  1 0.5495702  0.8267506

> subset( results(t2, level = "feature"), feature_id=='FBtr0100282')
       gene_id  feature_id lr df pvalue adj_pvalue
68 FBgn0034420 FBtr0100282 NA  1     NA         NA

Could you elaborate why it is a problem not filtering?

In advance - thanks. /Kristoffer

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.4                  DRIMSeq_1.3.4              
[3] PasillaTranscriptExpr_1.4.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10            magrittr_1.5            XVector_0.16.0         
 [4] edgeR_3.18.0            GenomicRanges_1.28.0    BiocGenerics_0.22.0    
 [7] devtools_1.12.0         zlibbioc_1.22.0         IRanges_2.10.0         
[10] munsell_0.4.3           BiocParallel_1.10.0     colorspace_1.3-2       
[13] lattice_0.20-35         stringr_1.2.0           GenomeInfoDb_1.12.0    
[16] tools_3.4.0             parallel_3.4.0          grid_3.4.0             
[19] gtable_0.2.0            withr_1.0.2             lazyeval_0.2.0         
[22] digest_0.6.12           tibble_1.3.0            GenomeInfoDbData_0.99.0
[25] reshape2_1.4.2          ggplot2_2.2.1           S4Vectors_0.14.0       
[28] bitops_1.0-6            RCurl_1.95-4.8          memoise_1.1.0          
[31] stringi_1.1.5           limma_3.32.1            compiler_3.4.0         
[34] scales_0.4.1            stats4_3.4.0            locfit_1.5-9.1

gosianow commented 7 years ago

Hello,

I had a look to the code.

In the case of no filtering, many of the NAs are generated because there are genes with only one transcript or with transcripts that have zeros in all the samples.

Such cases should be filtered out, and such minimal filtering can be done by runningdmFilter(d)with no parameters.

However, even after this filtering, there are still few genes for which NAs are generated.

Usually, those genes have transcripts with many zeros, and I think this a reason why the precision (dispersion) can not be estimated (there are NAs in genewise_precision(d2)) and then the proportions can not be calculated.

Currently, this can not be avoided and to stabilize the estimation some more filtering must be done to reduce the amount of zeros. But it could be done based on transcripts only, for example:

dmFilter(d, min_samps_feature_expr = 3, min_feature_expr = 1)

I can have a deeper look to that later. I think it will be necessary to improve the optimization strategy. Currently, we use some R optimization functions, but maybe a more robust implementation of Newton-Raphson algorithm will be needed.

Best,

Gosia

### Set up R session
library('DRIMSeq')
library('PasillaTranscriptExpr')
data_dir  <- system.file("extdata", package = "PasillaTranscriptExpr")
pasilla_metadata <- read.table(file.path(data_dir, "metadata.txt"), header = TRUE, as.is = TRUE)
pasilla_counts <- read.table(file.path(data_dir, "counts.txt"), header = TRUE, as.is = TRUE)

pasilla_samples <- data.frame(sample_id = pasilla_metadata$SampleName, group = pasilla_metadata$condition)
levels(pasilla_samples$group)
d <- dmDSdata(counts = pasilla_counts, samples = pasilla_samples)

gene_id_subset <- readLines(file.path(data_dir, "gene_id_subset.txt"))
d <- d[names(d) %in% gene_id_subset, ]

design_full <- model.matrix(~ group, data = samples(d))
design_full

### With filtering
d1 <- dmFilter(d, min_samps_gene_expr = 7, min_samps_feature_expr = 3, min_gene_expr = 10, min_feature_expr = 10)

d1 <- dmPrecision(d1, design = design_full)

d1 <- dmFit(d1, design = design_full, bb_model=TRUE)

### Using one way approach

d2 <- dmFilter(d)

d2 <- dmPrecision(d2, design = design_full)

genewise_precision(d2)

d2 <- dmFit(d2, design = design_full, bb_model=TRUE)

## Get fitted proportions

p2 <- proportions(d2)

c2 <- counts(d2)

c2[c2$gene_id %in% p2[is.na(p2[, 3]), "gene_id"], ]

### Using regression

d3 <- dmFilter(d)

d3 <- dmPrecision(d3, design = design_full, one_way = FALSE)

genewise_precision(d3)

d3 <- dmFit(d3, design = design_full, bb_model = TRUE, one_way = FALSE)

## Get fitted proportions

p3 <- proportions(d3)

c3 <- counts(d3)

c3[c3$gene_id %in% p3[is.na(p3[, 3]), "gene_id"], ]

### Using one way approach + some filtering (but with min_samps_feature_expr = 1:3, there is still one gene with NAs)

d2 <- dmFilter(d, min_samps_feature_expr = 4, min_feature_expr = 1)

d2 <- dmPrecision(d2, design = design_full)

genewise_precision(d2)

d2 <- dmFit(d2, design = design_full, bb_model=TRUE)

## Get fitted proportions

p2 <- proportions(d2)

c2 <- counts(d2)

c2[c2$gene_id %in% p2[is.na(p2[, 3]), "gene_id"], ]

### Using regression + some filtering (but with min_samps_feature_expr = 1:2, there is still one gene with NAs)

d3 <- dmFilter(d, min_samps_feature_expr = 3, min_feature_expr = 1)

d3 <- dmPrecision(d3, design = design_full, one_way = FALSE)

genewise_precision(d3)

d3 <- dmFit(d3, design = design_full, bb_model = TRUE, one_way = FALSE)

## Get fitted proportions

p3 <- proportions(d3)

c3 <- counts(d3)

c3[c3$gene_id %in% p3[is.na(p3[, 3]), "gene_id"], ]

kvittingseerup commented 7 years ago

Tanks for clarifying that - it really shows the nessesarity of filtering the data :-)

gosianow / DRIMSeq_before_BioC3.6

NA in proportion estimats #2