gosianow / DRIMSeq_before_BioC3.6

Package for differential analyses using the Dirichlet Multinomial distribution
3 stars 0 forks source link

How to analyze experiment with 3 groups #3

Closed kvittingseerup closed 7 years ago

kvittingseerup commented 7 years ago

Hi

My problem is that I would like to analyze a experimental setup with 3 groups where I'm interested in each of the 3 pairwise comparison. I the optimal thing would be to make one model fit (with dmFit) and then afterwards test for differences between the 3 groups. But how do I do that?

An illiustration of the problem can easily be made with the Pasilla data:

### Load libraries
library('DRIMSeq')
library('PasillaTranscriptExpr')

### Lets used the pasilla data
data_dir  <- system.file("extdata", package = "PasillaTranscriptExpr")
pasilla_metadata <- read.table(file.path(data_dir, "metadata.txt"), header = TRUE, as.is = TRUE)
pasilla_counts <- read.table(file.path(data_dir, "counts.txt"), header = TRUE, as.is = TRUE)
gene_id_subset <- readLines(file.path(data_dir, "gene_id_subset.txt"))

### pasilla_samples design matrix with 3 groups
# basicly splitting ctrl into two
pasilla_samples <- data.frame(
    sample_id = pasilla_metadata$SampleName,
    group = c('Nr1','Nr1','Nr2','Nr3','Nr3','Nr3','Nr2')
)

Normally I would in this case use a intercept free design - but it does not seem like that is support (yet) ? (I get this error from dmTest when trying:

Error in solve.default(design_unique, t(logit_prop)) : 
  'a' (3 x 2) must be square

But still using contrasts making all 3 pairwise comparisons should be doable using contrasts:

## Create the design matrix without intercept due to no clear 'ground state'
design_full <- model.matrix(~ group, data = pasilla_samples)
colnames(design_full) <- gsub('group', '', colnames(design_full))
> head(design_full,7)
  (Intercept) Nr2 Nr3
1           1   0   0
2           1   0   0
3           1   1   0
4           1   0   1
5           1   0   1
6           1   0   1
7           1   1   0

### Create and subset dmData
d <- dmDSdata(counts = pasilla_counts, samples = pasilla_samples)
d <- d[names(d) %in% gene_id_subset, ]

### Calculate precision and fit
d <- dmPrecision(d, design = design_full)
d <- dmFit(d, design = design_full, bb_model=TRUE)

Then I can easily compare group 1 and 3:

t1 <- dmTest(d, coef = 'Nr2')
> design(t1)
  (Intercept) Nr3
1           1   0
2           1   0
3           1   0
4           1   1
5           1   1
6           1   1
7           1   0

But problem comes when I want to compare group 2 and 3:

contrast <- c(0,-1, 1)
t2 <- dmTest(d, contrast = contrast)
> design(t2)
       [,1]       [,2]
1 0.7071068 -0.7071068
2 0.7071068 -0.7071068
3 1.2071068 -0.2071068
4 1.2071068 -0.2071068
5 1.2071068 -0.2071068
6 1.2071068 -0.2071068
7 1.2071068 -0.2071068

Which looks wrong right? (I also get the same problem when trying to test group 1 vs group 2 using contrasts with (c(-1,1,0) )

Have I completely misunderstood how to do the contrasts? Is it possible to only build 1 model (with dmFit) and then do the 3 tests?

In advance - thanks /Kristoffer

gosianow commented 7 years ago

Hello,

I am sorry for my late response but I was very busy with submitting my thesis.

Yes, it is possible to build 1 model and fit 3 contrasts for all the two-group comparisons.

One has to use one_way = FALSE in the dmTest function. Like this the regression framework is used for the estimation. By default, one_way = TRUE and since the null design matrices do not correspond to multiple-group-fitting, there is an error.

I will fix that so the one_way parameter is adjusted automatically depending on the design matrix.

The code below tests the 3 contrasts.

### Load libraries
library('DRIMSeq')
library('PasillaTranscriptExpr')

### Lets used the pasilla data
data_dir  <- system.file("extdata", package = "PasillaTranscriptExpr")
pasilla_metadata <- read.table(file.path(data_dir, "metadata.txt"), header = TRUE, as.is = TRUE)
pasilla_counts <- read.table(file.path(data_dir, "counts.txt"), header = TRUE, as.is = TRUE)
gene_id_subset <- readLines(file.path(data_dir, "gene_id_subset.txt"))

### pasilla_samples design matrix with 3 groups
# basicly splitting ctrl into two

pasilla_samples <- data.frame(sample_id = pasilla_metadata$SampleName, group = pasilla_metadata$condition)

pasilla_samples <- data.frame(sample_id = pasilla_metadata$SampleName, group = c('Nr1','Nr1','Nr2','Nr3','Nr3','Nr3','Nr2'))

levels(pasilla_samples$group)

### Create and subset dmData
d <- dmDSdata(counts = pasilla_counts, samples = pasilla_samples)
d <- d[names(d) %in% gene_id_subset, ]
### Filtering
d <- dmFilter(d, min_samps_gene_expr = 7, min_samps_feature_expr = 3, min_gene_expr = 10, min_feature_expr = 10)

## Create the design matrix without intercept due to no clear 'ground state'

design_full2 <- model.matrix(~ -1 + group, data = pasilla_samples)
design_full2

### Calculate precision and fit
d2 <- dmPrecision(d, design = design_full2)

genewise_precision(d2)

d2 <- dmFit(d2, design = design_full2)

head(proportions(d2))

contrast1 <- c(-1, 1, 0)

t21 <- dmTest(d2, contrast = contrast1, one_way = FALSE, verbose = TRUE)

t21@design_fit_null

results(t21)

contrast2 <- c(-1, 0, 1)

t22 <- dmTest(d2, contrast = contrast2, one_way = FALSE, verbose = TRUE)

t22@design_fit_null

results(t22)

contrast3 <- c(0, -1, 1)

t23 <- dmTest(d2, contrast = contrast3, one_way = FALSE, verbose = TRUE)

t23@design_fit_null

results(t23)
gosianow commented 7 years ago

Hello,

the bug is fixed in the new 3.5 Bioconductor release. DRIMSeq version 1.4.1 https://www.bioconductor.org/packages/release/bioc/html/DRIMSeq.html

Now, one does not have to set one_way = FALSE in dmTest, unless wanted.

kvittingseerup commented 7 years ago

That is more than understandable - I hope everything went as it should :-).

Thanks for the thorough reply!

It does however lead me to 1 more question which ties into my previous question about the filtering ( https://github.com/markrobinsonuzh/DRIMSeq/issues/2 ). If i have a dataset with 3 comparisons that are all quite different (such as different developmental stages) there are two possible ways of analyzing the data: 1) Analyze each pairwise comparison individual (resulting in 3 separate DRIMSeq workflows (calculating precision and fitting model 3 times, 3 tests). 2) Build 1 model and test 3 times (1 precision, 1 fit and 3 tests).

The challenge is then that option 2 utilizes all the data to build a model whereby it potentially is more robust (and at lest the one I prefer from a theoretical viewpoint) but if some of the genes are lowly expressed in 1 condition (quite likely due to them being very different) the precition calculation will fail for that gene ( as we discussed in https://github.com/markrobinsonuzh/DRIMSeq/issues/2 ) which again leads to all pariwise comparisons failing meaning one would miss potential changes in that gene.

Do you have any recommendation or thought on what to do in such a situation?

/Kristoffer

gosianow commented 7 years ago

Yes, clearly the situation as you describe may take place.

As far as I know there is no one consensus strategy to follow in such a situation, as it strongly depends on the characteristics of the data and the biological interest.

As you described, both approaches have advantages and disadvantages, and I think one has to choose which aspects are more important for their biological question.

I would say that if there are indeed many genes that have such strong differential expression (in one condition the gene is almost silent) then the approach number 1 should be followed. One does not risk to loose some interesting genes during the filtering step.

In the DRIMSeq approach one assumes that the dispersion is equal in different conditions. If that is not the case, I think the pairwise comparisons should be more robust to that, though so far I have not tested/investigated such a scenario deeply.

For sure, it would be useful to run both approaches and see how many genes are eliminated during the filtering step, investigate the characteristics of genes that are detected to be DTU and also see how much the results overlap. Maybe in the end there is no substantial difference between the detected DTU genes for your dataset of interest.

Overall, to detect a meaningful transcript usage a gene must be expressed at least at some minimal level. This also speaks rather for the approach one where for each comparison this could be guarantied for the compared samples.