lcolladotor / derfinder

Annotation-agnostic differential expression analysis of RNA-seq data via expressed regions-level or single base-level approaches
http://lcolladotor.github.io/derfinder
42 stars 15 forks source link

Speed up p-value calculation #29

Closed lcolladotor closed 9 years ago

lcolladotor commented 9 years ago

It would be useful to speed up the p-value calculation step.

One quick solution would be to round the areas a bit, then find the p-values for the unique areas and then assign them to each candidate DER.

> length(fullRegions$area)
[1] 689543
## Not many unique areas
> length(unique(fullRegions$area))
[1] 689535
## By rounding to 2 digits the areas greatly simplify
> length(unique(round(fullRegions$area, 2)))
[1] 151811
lcolladotor commented 9 years ago

I seemed to have much faster code for this now. I'll make a couple tests before pushing the new code to the repo

jtleek commented 9 years ago

ok great. would be interested to make sure we have some tests to make sure the p-values don't change much.

On Tue, Jun 9, 2015 at 12:21 PM Leonardo Collado-Torres < notifications@github.com> wrote:

I seemed to have much faster code for this now. I'll make a couple tests before pushing the new code to the repo

— Reply to this email directly or view it on GitHub https://github.com/lcolladotor/derfinder/issues/29#issuecomment-110420811 .

lcolladotor commented 9 years ago
  Absolute difference (secs) Absolute difference (mins)
1                    32520.7                   542.0117
  Absolute difference (hours) original / new new / original
1                    9.033528       25055.47   3.991144e-05

Tested with Brainspan data, got exact same results in a few seconds compared to the 9 hours from before. The old version is 25k times slower than the new version with this data. I suspect that new version barely increases in run time as the number of values (both observed and null) increases. There are details about the speed of findInterval() in its help page.

In conclusion, this new function is great and I'll add it soon. Probably worth pushing to release branch, or could be a motivation to have users switch to devel. Although, commits to release branch are only supposed to be bug fixes.

See derMisc repo for details.

jtleek commented 9 years ago

wow that is awesome!

On Thu, Jun 11, 2015 at 4:26 PM Leonardo Collado-Torres < notifications@github.com> wrote:

Absolute difference (secs) Absolute difference (mins)1 32520.7 542.0117 Absolute difference (hours) original / new new / original1 9.033528 25055.47 3.991144e-05

Tested with Brainspan data, got exact same results in a few seconds compared to the 9 hours from before. The old version is 25k times slower than the new version with this data. I suspect that new version barely increases in run time as the number of values (both observed and null) increases. There are details about the speed of findInterval() in its help page.

In conclusion, this new function is great and I'll add it soon. Probably worth pushing to release branch, or could be a motivation to have users switch to devel. Although, commits to release branch are only supposed to be bug fixes.

See derMisc repo for details.

— Reply to this email directly or view it on GitHub https://github.com/lcolladotor/derfinder/issues/29#issuecomment-111268038 .