Can't successfully use numeric covariate with runJunctionSeqAnalyses()

davemcg commented 7 years ago

I'm fairly certain I'm doing something wrong, but I can't figure out what.

I'm trying to run the 'simple' analyses, but add one numeric covariate, which was very useful in differential gene expression analysis with DESeq2. I've tried following the guide here http://hartleys.github.io/JunctionSeq/doc/JunctionSeq.pdf, but I'm getting an error with when DESeq2 is creating the DESeqDataSet.

library(JunctionSeq)
library(dplyr)
load('~/git/pax2_rna-seq/data/SampleTable_qorts.Rdata')
junc_seq_info <- SampleTable_qorts %>% 
  filter(age=='E12_5',genotype!='Het') %>% 
  group_by(sample.ID, genotype, age) %>% 
  summarise(ReadPairs_UniqueGene_Ratio=mean(ReadPairs_UniqueGene_Ratio)) %>% 
  mutate(condition=factor(genotype)) %>% ungroup() %>% 
  select(sample.ID, condition, ReadPairs_UniqueGene_Ratio)

junc_seq_info <- junc_seq_info %>% mutate_if(is.character,as.factor) %>% data.frame()
junc_seq_info

row	sample.ID	condition	ReadPairs_UniqueGene_Ratio
1	E12_5HomoR1V	Homo	0.4110526
2	E12_5HomoR2V	Homo	0.6600969
3	E12_5HomoR3V	Homo	0.3238447
4	E12_5WTR1V	WT	0.4906356
5	E12_5WTR2V	WT	0.6612606
6	E12_5WTR3V	WT	0.3957412

countFiles <- paste0(junc_seq_info$sample.ID,"/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz")
jscs_cov <- runJunctionSeqAnalyses(
  sample.files=countFiles,
  sample.names=junc_seq_info$sample.ID,
  condition=junc_seq_info$condition,
  use.covars = junc_seq_info[,"ReadPairs_UniqueGene_Ratio",drop=F],
  flat.gff.file='withNovel.forJunctionSeq.gff.gz',
  test.formula0 = ~ sample + countBin + ReadPairs_UniqueGene_Ratio : countbin,
  test.formula1 = ~ sample + countBin + ReadPairs_UniqueGene_Ratio : countbin + condition : countbin, 
  effect.formula = ~ condition + countBin + ReadPairs_UniqueGene_Ratio : countbin + condition : countbin,
  geneLevel.formula = ~ ReadPairs_UniqueGene_Ratio + condition,
  nCores=10,verbose=TRUE)

I then get this error.

> STARTING runJunctionSeqAnalyses (v1.2.4) (Wed Oct  5 11:16:48 2016)
> rJSA: sample.files:  E12_5HomoR1V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5HomoR2V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5HomoR3V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5WTR1V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5WTR2V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz, E12_5WTR3V/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz
> rJSA: sample.names:  E12_5HomoR1V, E12_5HomoR2V, E12_5HomoR3V, E12_5WTR1V, E12_5WTR2V, E12_5WTR3V
> rJSA: condition:  Homo, Homo, Homo, WT, WT, WT
> rJSA: analysis.type:  junctionsAndExons
> rJSA: use.junctions:  TRUE
> rJSA: use.novel.junctions:  TRUE
> rJSA: use.exons:  TRUE
> rJSA: nCores:  10
> rJSA: use.covars:  c(0.411052625173834, 0.660096920418859, 0.323844651491804, 0.490635633557832, 0.661260596477357, 0.395741156898811)
> rJSA: test.formula0:  ~ sample + countBin + ReadPairs_UniqueGene_Ratio:countbin
> rJSA: test.formula1:  ~ sample + countBin + ReadPairs_UniqueGene_Ratio:countbin + condition:countbin
> rJSA: use.multigene.aggregates:  FALSE
> rJSA: using covars: Wed Oct  5 11:16:48 2016
      covar: ReadPairs_UniqueGene_Ratio
      0.411052625173834, 0.660096920418859, 0.323844651491804, 0.490635633557832, 0.661260596477357, 0.395741156898811
> rJSA: Reading Count files... Wed Oct  5 11:16:48 2016.
-> STARTING readJunctionSeqCounts (Wed Oct  5 11:16:48 2016)
---> RJSC; (v1.2.4)
---> RJSC: samplenames: E12_5HomoR1V,E12_5HomoR2V,E12_5HomoR3V,E12_5WTR1V,E12_5WTR2V,E12_5WTR3V
---> RJSC: flat.gff.file: withNovel.forJunctionSeq.gff.gz
---> RJSC: use.exons:TRUE
---> RJSC: use.junctions:TRUE
---> RJSC: use.novel.junctions:TRUE
---> File read complete.
---> Extracted counts. Found 575963 features so far.
---> Extracted gene-level counts. Found: 43526 genes and aggregate-genes.
---> Removed gene features. Found: 532437 features to be included so far.
---> Note: 143519 counting bins from overlapping genes
--->          There are 3697 multigene aggregates.
--->          There are 8611 genes that are part of an aggregate.
---> Removed multigene-aggregate features. Found: 388918 features to be included so far.
---> Final feature count: 388918 features to be included in the analysis.
---> Extracted feature counts.
---> counts complete.
-----> reading annotation...
-----> formatting annotation...
-----> initial generation...
-----> creating jscs...
-----> generating count vectors... (Wed Oct  5 11:18:14 2016)
    [[Using package "BiocParallel" for parallelization. (Using 10 cores)]]
    getAllJunctionSeqCountVectors: dim(counts) = 388918,6 (Wed Oct  5 11:18:22 2016)
    getAllJunctionSeqCountVectors: dim(gct) = 43526,6
    getAllJunctionSeqCountVectors: out generated. dim = 388918,12 (Wed Oct  5 11:18:36 2016)
-----> count vectors generated (Wed Oct  5 11:18:36 2016)
-----> generating DESeqDataSet... (Wed Oct  5 11:18:36 2016)
Error in DESeqDataSet(se, design = design, ignoreRank) : 
  all variables in design formula must be columns in colData

hartleys commented 7 years ago

So there are two separate problems.

First off: several of your formulas reference "countBin", when it should be "countbin" (note the lower-case 'b'). That's what's giving you the error you are seeing.

Once you fix that typo, you're going to immediately run into a tougher issue: JunctionSeq isn't really designed to deal with continuous data. That problem is going to be much tougher to work around...

You might be able to split it into high/low?

davemcg commented 7 years ago

Ugh. I'm an idiot.
Well, I'll try high/low then. It kind of fits that. I saw the error messages about having to be factors.

hartleys / JunctionSeq

Can't successfully use numeric covariate with runJunctionSeqAnalyses() #15