harrelfe / rms

Regression Modeling Strategies
https://hbiostat.org/R/rms
Other
170 stars 48 forks source link

Suggesting two little performance improvements #23

Open piotrp88 opened 8 years ago

piotrp88 commented 8 years ago

The following suggestions bring a performance improvement that increases as dataset size increases. As far as I have seen, performance improvements easily reaches ~20% on a "small" 64MB dataset. I am currently testing on a 200MB dataset. If those suggestions will be implemented, I would test them on a 6.9GB dataset with about 150 milions of rows. I am expecting more than 20% perfomance improvements on bigger datasets. By the way, I'm testing this using MRO (Microsoft R Open) which is faster than R. I don't know why but cph() gains more than coxph() using MKL (which is a fast multithread math library from Intel shipped with MRO).

Suggestion n. 1

I would like to suggest an edit to R/cph.s in order to better improve performance when linear.predictors=F. I think that this suggestion accomodates the intention of setting linear.predictors=F: everything related to linear.predictors shouldn't be calculated, so GiniMd() and dxy.cens() shouldn't be invoked.

Lines 270 and 271 now are:

gindex <- GiniMd(f$linear.predictors)
dxy <- dxy.cens(f$linear.predictors, Y, type='hazard')['Dxy']

I suggest the following:

gindex <- NA; if(linear.predictors) gindex <- GiniMd(f$linear.predictors)
dxy <- NA; if(linear.predictors) dxy <- dxy.cens(f$linear.predictors, Y, type='hazard')['Dxy']

Suggestion n. 2

Following commit https://github.com/harrelfe/rms/commit/23a4c06b7fe3e5f36ad9b1c77b17884435a1a0d6 the nonames parameter should be also evaluated at line 297.

Line 297 now is:

names(f$linear.predictors) <- rnam

I suggest the following:

if(nonames) names(f$linear.predictors) <- rnam
piotrp88 commented 8 years ago

I compared survival versus rms 4.4-2 as the first batch. Than I compared survival against rms 4.4-2 after implementing the above suggestions as the second batch. I repeated the same test using a 200MB dataset.

Using the 64MB dataset, the modified rms library performed ~18.15% better than the original one and using the 200MB dataset it performed even better, ~20.60%.

These are the results of my tests (timings are averaged per each batch; there are 3 runs per batch and 2 batches per dataset, 24 total runs):

# ---------- 64MB dataset (3 runs per batch) -----------
# survival VS rms 4.4.2
survival: 44.583333
rms: 39.735333
Speed gain: 10.87%

# survival VS rms 4.4-2 (using suggested edits)
survival: 43.524000
rms: 32.525000
Speed gain: 25.27%

# ---------- 200MB dataset (3 runs per batch) -----------
# survival VS rms 4.4-2
survival: 172.125667
rms: 155.115000
Speed gain: 9.88%

# survival VS rms 4.4-2 (using suggested edits)
survival: 171.674667
rms: 123.157000
Speed gain: 28.26%

As dataset size increases, performance gain increases too.

harrelfe commented 8 years ago

On the first item, I don't quite agree that not storing the lp should imply that certain summary statistics should not be computed.

On the second suggestion I'd like to know the time overhead of names(large vector) <- NULL

Frank

piotrp88 commented 8 years ago

For the first item it would probably be useful to add an argument to cph() like one of the followings:

More generally it would be better to let the user have control over which computations have to be done when performance matters, especially for computations which involve vectors or matrixes of n rows.

For the second suggestion the time overhead of names(large vector) <- NULL on my machine is cetrainly trascurable:

> huge.vector <- rnorm(100000000, 1, 2)
> system.time(names(huge.vector) <- NULL)
   user  system elapsed 
  0.170   0.237   0.407 

but I'm running MRO. You should try using a standard R version.