Differential Usage Numbers only

FallenAzazel commented 8 years ago

Hi,

I am currently going through some RNA-seq data using QoRTs and JunctionSeq. I have been very impressed with JunctionSeq and the amount of visualisation it produces to help interpret the data.

This can often be time consuming on a dataset, I have gone through the literature as much as possible but wanted to ask if there is a shortcut to getting the actual counts for higher in Case 1 and higher in Case 2 events without having to build all the plots and check the testDU.html file. I would like to run a Monte Carlo type simulation on my control dataset and having to build all plots just to get those two figures per test would be very time consuming.

Apologies if there is a quick way to those numbers that I havent seen,

Appreciate your assistance :)

hartleys commented 8 years ago

I think what you want is the writeCompleteResults function.

This function can be run after all the analysis, but before (or instead of) the plotting and visualization. It produces a pile of gzip-compressed tab-delimited text tables with the information you're looking for. By default it prints a whole bunch of different files, but it has options to turn them off individually to save time and disk space.

For example, if you just want the big expression tables containing all the genes, you can use the command:

 writeCompleteResults(jscs, outfile.prefix, 
                    save.sigGenes = FALSE,
                    save.bedTracks = FALSE)

or if you just want the genes with statistically significant results in them:

 writeCompleteResults(jscs, outfile.prefix, 
                    save.allGenes = FALSE,
                    save.bedTracks = FALSE)

Which numbers exactly are you looking for?

FallenAzazel commented 8 years ago

Thanks for the reply, I will give that a go.

The numbers I am after are the figures from the MA plot giving the total higher in case 1 and higher in case 2.

Really do appreciate your help with this. Cheers

hartleys commented 8 years ago

So what exactly are you trying to do here? Just to check: you are aware that JunctionSeq does not simply detect exons/junctions that are higher/lower in case 1 vs case 2, yes?

What it actually does is detect differential USAGE, in which an exon or junction is higher/lower _relative_ to the expression of the gene as a whole.

Differential usage of exons is intended for use as a proxy for differential usage of full-length transcripts. Thus: two exons being DU doesn't necessarily mean anything more is going on than if one exon is DU. Both exons may belong to the same transcript, which is what is driving the differential usage of both exons.

Furthermore: the structure of the model means there is no universal "baseline" for comparison, and the raw number of exons that are marked as significant may not actually mean anything.

For example: if one exon is being differentially skipped in case 1, then JunctionSeq might:

Mark that exon as differentially used.
Mark all other exons as differentially used.
Mark all exons as differentially used.

Technically, all three are "correct" under the definition of differential usage. Which one JunctionSeq will do will depending on technical factors like the size, coverage, read density, and amount of sequence bias over the various exons and junctions.

I think what you actually want is the number of genes with significant DU. You can get this number fairly easily using a command like:

 length(unique(fData(jscs)$geneID[(fData(jscs)$geneWisePadj < 0.01)]))

fData(jscs) gets the big results matrix with all the p-values in it. Then we just count the number of unique geneID's where the p-value is less than 0.01.

Note that this does not distinguish between up-regulation and down-regulation of exons. But since everything is relative to gene-level expression that wouldn't mean anything anyways. If one transcript is up-regulated relative to gene-level expression, then by definition at least one other transcript must be down-regulated relative to gene-level expression (since "gene-level expression" is just an aggregate of all the transcripts).

Does this make sense? Is this what you are looking for?

FallenAzazel commented 8 years ago

Thank you so much for your reply. Sorry about the delay in putting a response.

To give a better explanation of what we are trying to do.

I have some RNA-seq data for control samples vs affected samples. I have found through DESeq2 that there is a lot of differential expression of transcripts going on. We were also interested to see if there are any novel transcripts in the dataset as well as looking at alternative splicing of exons. We had looked at DEXSeq but was recommended to investigate JunctionSeq which I have been doing for the last few weeks.

So with the data it appeared comparison of the control vs affected JunctionSeq was noting over 3000 DU events. Now this part might not make much sense, it was suggested by the PI to look at a subset of the control dataset (for example 5 control vs 6 control) and try a monte carlo simulation and see if lots of DU events would be reported or very few (his way to see if what was being reported for the control vs affected was genuine.....)

To be honest it would be great to communicate via email if you would be willing ?

Thanks again!!

hartleys commented 8 years ago

So the individual exonic regions are often extremely correlated, particularly in highly-annotated exons where each marked region might only be a few bp wide. This means that often adjacent exonic regions will have almost the exact same counts (since most of the reads that cover one also cover the other).

You want the gene-level results. Again, you can get the # of significant genes using the command:

 length(unique(fData(jscs)$geneID[(fData(jscs)$geneWisePadj < 0.01)]))

You're probably going to want the p-values themselves. You can pull the p-value for each gene with the command:

 sapply(as.character(unique(fData(jscs)$geneID)), 
    function(g){ 
       fData(jscs)$geneWisePadj[as.character(fData(jscs)$geneID) == g][1]
    }, 
    USE.NAMES=TRUE)

(That's just off the top of my head. I'm sure there's an easier way to do that...)

Note that these are adjusted p-values. Unfortunately, JunctionSeq does not calculate unadjusted gene-level p-values, since it does the multiplicity adjustment for the multiple exons and multiple genes simultaneously.

I'm not sure exactly how you plan to set up your simulations, but these will at least give you an approximate idea of what a null dataset would give you. Run the various 5v6 control-control experiments and see what you get.

FallenAzazel commented 8 years ago

Thank you so much for the detailed help, its very much appreciated.

I am just going to loop through the combinations of the 5v5 from the dataset and see how we go, thank you :+1:

hartleys / JunctionSeq

Differential Usage Numbers only #14