hartleys / JunctionSeq

The JunctionSeq R package is a powerful tool for testing and visualizing differential usage of exons and splice junctions in next-generation, high-throughput RNA-Seq experiments.
28 stars 16 forks source link

Can't successfully use numeric covariate with runJunctionSeqAnalyses() #15

Closed davemcg closed 7 years ago

davemcg commented 7 years ago

I'm fairly certain I'm doing something wrong, but I can't figure out what.

I'm trying to run the 'simple' analyses, but add one numeric covariate, which was very useful in differential gene expression analysis with DESeq2. I've tried following the guide here http://hartleys.github.io/JunctionSeq/doc/JunctionSeq.pdf, but I'm getting an error with when DESeq2 is creating the DESeqDataSet.

library(JunctionSeq)
library(dplyr)
load('~/git/pax2_rna-seq/data/SampleTable_qorts.Rdata')
junc_seq_info <- SampleTable_qorts %>% 
  filter(age=='E12_5',genotype!='Het') %>% 
  group_by(sample.ID, genotype, age) %>% 
  summarise(ReadPairs_UniqueGene_Ratio=mean(ReadPairs_UniqueGene_Ratio)) %>% 
  mutate(condition=factor(genotype)) %>% ungroup() %>% 
  select(sample.ID, condition, ReadPairs_UniqueGene_Ratio)

junc_seq_info <- junc_seq_info %>% mutate_if(is.character,as.factor) %>% data.frame()
junc_seq_info
row sample.ID condition ReadPairs_UniqueGene_Ratio
1 E12_5HomoR1V Homo 0.4110526
2 E12_5HomoR2V Homo 0.6600969
3 E12_5HomoR3V Homo 0.3238447
4 E12_5WTR1V WT 0.4906356
5 E12_5WTR2V WT 0.6612606
6 E12_5WTR3V WT 0.3957412
countFiles <- paste0(junc_seq_info$sample.ID,"/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz")
jscs_cov <- runJunctionSeqAnalyses(
  sample.files=countFiles,
  sample.names=junc_seq_info$sample.ID,
  condition=junc_seq_info$condition,
  use.covars = junc_seq_info[,"ReadPairs_UniqueGene_Ratio",drop=F],
  flat.gff.file='withNovel.forJunctionSeq.gff.gz',
  test.formula0 = ~ sample + countBin + ReadPairs_UniqueGene_Ratio : countbin,
  test.formula1 = ~ sample + countBin + ReadPairs_UniqueGene_Ratio : countbin + condition : countbin, 
  effect.formula = ~ condition + countBin + ReadPairs_UniqueGene_Ratio : countbin + condition : countbin,
  geneLevel.formula = ~ ReadPairs_UniqueGene_Ratio + condition,
  nCores=10,verbose=TRUE)

I then get this error.

> STARTING runJunctionSeqAnalyses (v1.2.4) (Wed Oct  5 11:16:48 2016)
> rJSA: sample.files:  E12_5HomoR1V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5HomoR2V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5HomoR3V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5WTR1V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5WTR2V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5WTR3V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz
> rJSA: sample.names:  E12_5HomoR1V, E12_5HomoR2V, E12_5HomoR3V, E12_5WTR1V, E12_5WTR2V, E12_5WTR3V
> rJSA: condition:  Homo, Homo, Homo, WT, WT, WT
> rJSA: analysis.type:  junctionsAndExons
> rJSA: use.junctions:  TRUE
> rJSA: use.novel.junctions:  TRUE
> rJSA: use.exons:  TRUE
> rJSA: nCores:  10
> rJSA: use.covars:  c(0.411052625173834, 0.660096920418859, 0.323844651491804, 0.490635633557832, 0.661260596477357, 0.395741156898811)
> rJSA: test.formula0:  ~ sample + countBin + ReadPairs_UniqueGene_Ratio:countbin
> rJSA: test.formula1:  ~ sample + countBin + ReadPairs_UniqueGene_Ratio:countbin + condition:countbin
> rJSA: use.multigene.aggregates:  FALSE
> rJSA: using covars: Wed Oct  5 11:16:48 2016
      covar: ReadPairs_UniqueGene_Ratio
      0.411052625173834, 0.660096920418859, 0.323844651491804, 0.490635633557832, 0.661260596477357, 0.395741156898811
> rJSA: Reading Count files... Wed Oct  5 11:16:48 2016.
-> STARTING readJunctionSeqCounts (Wed Oct  5 11:16:48 2016)
---> RJSC; (v1.2.4)
---> RJSC: samplenames: E12_5HomoR1V,E12_5HomoR2V,E12_5HomoR3V,E12_5WTR1V,E12_5WTR2V,E12_5WTR3V
---> RJSC: flat.gff.file: withNovel.forJunctionSeq.gff.gz
---> RJSC: use.exons:TRUE
---> RJSC: use.junctions:TRUE
---> RJSC: use.novel.junctions:TRUE
---> File read complete.
---> Extracted counts. Found 575963 features so far.
---> Extracted gene-level counts. Found: 43526 genes and aggregate-genes.
---> Removed gene features. Found: 532437 features to be included so far.
---> Note: 143519 counting bins from overlapping genes
--->          There are 3697 multigene aggregates.
--->          There are 8611 genes that are part of an aggregate.
---> Removed multigene-aggregate features. Found: 388918 features to be included so far.
---> Final feature count: 388918 features to be included in the analysis.
---> Extracted feature counts.
---> counts complete.
-----> reading annotation...
-----> formatting annotation...
-----> initial generation...
-----> creating jscs...
-----> generating count vectors... (Wed Oct  5 11:18:14 2016)
    [[Using package "BiocParallel" for parallelization. (Using 10 cores)]]
    getAllJunctionSeqCountVectors: dim(counts) = 388918,6 (Wed Oct  5 11:18:22 2016)
    getAllJunctionSeqCountVectors: dim(gct) = 43526,6
    getAllJunctionSeqCountVectors: out generated. dim = 388918,12 (Wed Oct  5 11:18:36 2016)
-----> count vectors generated (Wed Oct  5 11:18:36 2016)
-----> generating DESeqDataSet... (Wed Oct  5 11:18:36 2016)
Error in DESeqDataSet(se, design = design, ignoreRank) : 
  all variables in design formula must be columns in colData
hartleys commented 7 years ago

So there are two separate problems.

First off: several of your formulas reference "countBin", when it should be "countbin" (note the lower-case 'b'). That's what's giving you the error you are seeing.

Once you fix that typo, you're going to immediately run into a tougher issue: JunctionSeq isn't really designed to deal with continuous data. That problem is going to be much tougher to work around...

You might be able to split it into high/low?

davemcg commented 7 years ago
  1. Ugh. I'm an idiot.
  2. Well, I'll try high/low then. It kind of fits that. I saw the error messages about having to be factors.