Open piotrp88 opened 8 years ago
I compared survival
versus rms 4.4-2
as the first batch. Than I compared survival
against rms 4.4-2
after implementing the above suggestions as the second batch. I repeated the same test using a 200MB dataset.
Using the 64MB dataset, the modified rms
library performed ~18.15% better than the original one and using the 200MB dataset it performed even better, ~20.60%.
These are the results of my tests (timings are averaged per each batch; there are 3 runs per batch and 2 batches per dataset, 24 total runs):
# ---------- 64MB dataset (3 runs per batch) -----------
# survival VS rms 4.4.2
survival: 44.583333
rms: 39.735333
Speed gain: 10.87%
# survival VS rms 4.4-2 (using suggested edits)
survival: 43.524000
rms: 32.525000
Speed gain: 25.27%
# ---------- 200MB dataset (3 runs per batch) -----------
# survival VS rms 4.4-2
survival: 172.125667
rms: 155.115000
Speed gain: 9.88%
# survival VS rms 4.4-2 (using suggested edits)
survival: 171.674667
rms: 123.157000
Speed gain: 28.26%
As dataset size increases, performance gain increases too.
On the first item, I don't quite agree that not storing the lp should imply that certain summary statistics should not be computed.
On the second suggestion I'd like to know the time overhead of names(large vector) <- NULL
Frank
For the first item it would probably be useful to add an argument to cph()
like one of the followings:
cph(..., compute.gini=T, compute.dxy=T)
cph(..., discr.indexes=c("R2", "Dxy", "g", "gr"))
so if one specifies cph(..., discr.indexes=NULL)
then no discrimination indexes will be computed while cph(..., discr.indexes=c("R2"))
only computes R2 index.More generally it would be better to let the user have control over which computations have to be done when performance matters, especially for computations which involve vectors or matrixes of n
rows.
For the second suggestion the time overhead of names(large vector) <- NULL
on my machine is cetrainly trascurable:
> huge.vector <- rnorm(100000000, 1, 2)
> system.time(names(huge.vector) <- NULL)
user system elapsed
0.170 0.237 0.407
but I'm running MRO. You should try using a standard R version.
The following suggestions bring a performance improvement that increases as dataset size increases. As far as I have seen, performance improvements easily reaches ~20% on a "small" 64MB dataset. I am currently testing on a 200MB dataset. If those suggestions will be implemented, I would test them on a 6.9GB dataset with about 150 milions of rows. I am expecting more than 20% perfomance improvements on bigger datasets. By the way, I'm testing this using MRO (Microsoft R Open) which is faster than R. I don't know why but
cph()
gains more thancoxph()
using MKL (which is a fast multithread math library from Intel shipped with MRO).Suggestion n. 1
I would like to suggest an edit to
R/cph.s
in order to better improve performance whenlinear.predictors=F
. I think that this suggestion accomodates the intention of settinglinear.predictors=F
: everything related tolinear.predictors
shouldn't be calculated, soGiniMd()
anddxy.cens()
shouldn't be invoked.Lines 270 and 271 now are:
I suggest the following:
Suggestion n. 2
Following commit https://github.com/harrelfe/rms/commit/23a4c06b7fe3e5f36ad9b1c77b17884435a1a0d6 the
nonames
parameter should be also evaluated at line 297.Line 297 now is:
I suggest the following: