Closed kohleth closed 7 years ago
Dear kohleth,
To be honest with you: a dream has come true, I was hoping to get some feedback on my code now that the package gains in popularity! Thank you so very much!
I was also thinking about some improvements lately:
OneR
: Finding which attribute is the best predictor can be sped up massively by just calculating the number of errors (nerrors
) for each “tabled” attribute without splitting and Mapping at all, so for the main routine instead of:perf <- c()
for (iter in 1:(ncol(data) - 1)) {
groups <- split(data, data[ , iter])
majority <- lapply(groups, mode)
real <- lapply(groups, function(x) x[ , ncol(x)])
perf <- c(perf, sum(unlist(Map("==", real, majority))))
}
just:
nerrors <- function(tbl) {
sum(rowSums(tbl) - apply(tbl, 1, max))
}
and
tables <- lapply(data[ , 1:(ncol(data)-1)], table, data[ , ncol(data)])
errors <- sapply(tables, nerrors)
perf <- nrow(data) - errors
optbin
, infogain_midpoint
: a small improvement can be realized by minimizing after_entropies
instead of maximizing infogains
, so instead of:# calculate information gains and chose highest one
infogains <- entropy(target) - (below_entropies + above_entropies)
midpoints[which.max(infogains)]
the following:
# calculate entropies after split and choose lowest
after_entropies <- below_entropies + above_entropies
midpoints[which.min(after_entropies)]
What do you think? Perhaps you want to try this out too with your large dataset? I would be very interested in the performance gain.
Do you see any other improvements?
I am not so well versed when it comes to using github… do I have to do something concerning your pull-request?
Thank you very much again!
best holger
Great! Normally, if my code is useful, you can choose to merge it. But in this case because you have even better code, you can just ignore this thread (I will close it). But it will be great if you can push your latest code to the repository and then CRAN.
The only other change, not improvement, that I would hope for is for OneR
to actually accept arguments x
and y
instead of formula. Of course, you can always add a wrapper on top of it, so something like OneR.default (x,y,...)
and OneR(formula,...)
. The reason for this is so that it becomes more compatible with other caret
methods, so it becomes easier make a custom caret
method with your package.
Dear kohleth,
Thank you again!
Have you tried my new code with your data set? How fast does it run?
Your second point: I think I have an idea what you mean, but could you give me some more details or a reference how to do it exactly?
Thank you for the very productive conversation.
best h
oh yes, my last test executing the entire function gives a median of 2.7s for my code and 1.8 for your new code.
With the second point, I don't really know a good reference. perhaps the section titled 'Generic function and method dispatch' in here?
I think ultimately there will be 3 functions: the generic OneR=function(x,...)UseMethod("OneR")
, one for formula OneR.formula=function(form,...){the first 13 lines of the current OneR, plus a call to OneR.default}
and finally OneR.default=function(x,y,...){the rest of the current OneR}
. This way, user can call OneR
with either a formula
argument or data.frame
argument
Perhaps the transpose function is a good example. base:::t
, base:::t.data.frame
, base:::t.default
.
Thank you, sounds good.
Just to be sure: You hopefully got the same result each time, so that my new code is not only fast, but also correct?
(of course I will conduct a lot of tests before I deploy it to CRAN.)
yes, I checked the result. From the OneR function perspective, the results are the same. But the code snippets actually produced slightly different result -- same numerical value, but the original snippet return a numeric perf, whereas the new one return an integer perf.
On Thu, Apr 27, 2017 at 4:37 PM, vonjd notifications@github.com wrote:
Thank you, sounds good.
Just to be sure: You hopefully got the same result each time, so that my new code is not only fast, but also correct?
(of course I will conduct a lot of tests before I deploy it to CRAN.)
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vonjd/OneR/pull/3#issuecomment-297625148, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAxNsjZD5flIN2ET_xUc7911N_6osCfks5r0DeZgaJpZM4NIZv6 .
-- Kohleth ^_^
Very interesting observation. I will check whether this could pose some problems later on in the code (but it seems uncritical at the moment).
Just pushed the new optimized version to github... would be great if you could again try this version with your big dataset - Thank you!
Hi vonjd,
just tried the github code. It certainly worked fine and well.
On Sat, Apr 29, 2017 at 2:34 AM, vonjd notifications@github.com wrote:
Just pushed the new optimized version to github... would be great if you could again try this version with your big dataset - Thank you!
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vonjd/OneR/pull/3#issuecomment-298045637, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAxNl5nqdBvkwPrXffOWYwzEMweMTLJks5r0hT6gaJpZM4NIZv6 .
-- Kohleth ^_^
Great!
Now I have added the S3 method interface to the OneR and optbin functions for formulas and data frames as suggested. If you could have a look at them that would be very helpful :-)
After additional tests I will push the new version 2.2 to CRAN
Hi vonjd, I tried the new code and both the formula version and the data.frame version work fine.
Originally when I proposed the data.frame version, I was really thinking about one that takes both an X and y argument, rather than just a X and assumes the last column is the response. But this is purely stylistic and not an issue of right and wrong, I briefly checked other methods, those that take X and y includes: glmnet, caret::train. randomForest, nnet,, lda Those that take a single X includes: (g)lm, rpart,
So there are support for both types.
Thanks again.
On Tue, May 2, 2017 at 1:54 AM, vonjd notifications@github.com wrote:
Great!
Now I have added the S3 method interface to the OneR and optbin functions for formulas and data frames as suggested. If you could have a look at them that would be very helpful :-)
After additional tests I will push the new version 2.2 to CRAN
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vonjd/OneR/pull/3#issuecomment-298360349, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAxNosGJlFHq5ZQtEUsRaKr6uxw0aDwks5r1gBPgaJpZM4NIZv6 .
-- Kohleth ^_^
@kohleth : Thank you very much for your feedback.
I pushed the new version 2.2 to CRAN and only got positive feedback so far, so everything seems to be alright.
Thank you again
best h
Hi there, thanks for the package! I recently applied it to my large-ish dataset (2k rows, 1k cols) and it was quite slow. I tracked the slowness to 2 issues.
split(data, data[,iter])
can be quite slow whendata
is large, so I replaced it withsplit(data[,ncol(data)],data[,iter])
, as we only need the response column for the split.sum(unlist(Map("==", real, majority))
is quite slow again, so I replaced it by codes which first form the entire vector of predicted value (not in groups), then compared it directly againstdata[,ncol(data)]
.This new code took 0.7s on my dataset, while the old code took 66s (almost a 100x speed up).