optimized code to work with large dataset.

kohleth commented 7 years ago

Hi there, thanks for the package! I recently applied it to my large-ish dataset (2k rows, 1k cols) and it was quite slow. I tracked the slowness to 2 issues.

split(data, data[,iter]) can be quite slow when data is large, so I replaced it with split(data[,ncol(data)],data[,iter]), as we only need the response column for the split.
sum(unlist(Map("==", real, majority)) is quite slow again, so I replaced it by codes which first form the entire vector of predicted value (not in groups), then compared it directly against data[,ncol(data)].

This new code took 0.7s on my dataset, while the old code took 66s (almost a 100x speed up).

vonjd commented 7 years ago

Dear kohleth,

To be honest with you: a dream has come true, I was hoping to get some feedback on my code now that the package gains in popularity! Thank you so very much!

I was also thinking about some improvements lately:

OneR: Finding which attribute is the best predictor can be sped up massively by just calculating the number of errors (nerrors) for each “tabled” attribute without splitting and Mapping at all, so for the main routine instead of:

perf <- c()
for (iter in 1:(ncol(data) - 1)) {
  groups <- split(data, data[ , iter])
  majority <- lapply(groups, mode)
  real <- lapply(groups, function(x) x[ , ncol(x)])
  perf <- c(perf, sum(unlist(Map("==", real, majority))))
}

just:

nerrors <- function(tbl) {
  sum(rowSums(tbl) - apply(tbl, 1, max))
}

and

tables <- lapply(data[ , 1:(ncol(data)-1)], table, data[ , ncol(data)])
errors <- sapply(tables, nerrors)
perf <- nrow(data) - errors

optbin, infogain_midpoint: a small improvement can be realized by minimizing after_entropiesinstead of maximizing infogains, so instead of:

# calculate information gains and chose highest one
infogains <- entropy(target) - (below_entropies + above_entropies)
midpoints[which.max(infogains)]

the following:

# calculate entropies after split and choose lowest
after_entropies <- below_entropies + above_entropies
midpoints[which.min(after_entropies)]

What do you think? Perhaps you want to try this out too with your large dataset? I would be very interested in the performance gain.

Do you see any other improvements?

I am not so well versed when it comes to using github… do I have to do something concerning your pull-request?

Thank you very much again!

best holger

kohleth commented 7 years ago

Great! Normally, if my code is useful, you can choose to merge it. But in this case because you have even better code, you can just ignore this thread (I will close it). But it will be great if you can push your latest code to the repository and then CRAN.

The only other change, not improvement, that I would hope for is for OneR to actually accept arguments x and y instead of formula. Of course, you can always add a wrapper on top of it, so something like OneR.default (x,y,...) and OneR(formula,...). The reason for this is so that it becomes more compatible with other caret methods, so it becomes easier make a custom caret method with your package.

vonjd commented 7 years ago

Dear kohleth,

Thank you again!

Have you tried my new code with your data set? How fast does it run?

Your second point: I think I have an idea what you mean, but could you give me some more details or a reference how to do it exactly?

Thank you for the very productive conversation.

best h

kohleth commented 7 years ago

oh yes, my last test executing the entire function gives a median of 2.7s for my code and 1.8 for your new code.

With the second point, I don't really know a good reference. perhaps the section titled 'Generic function and method dispatch' in here? I think ultimately there will be 3 functions: the generic OneR=function(x,...)UseMethod("OneR"), one for formula OneR.formula=function(form,...){the first 13 lines of the current OneR, plus a call to OneR.default} and finally OneR.default=function(x,y,...){the rest of the current OneR}. This way, user can call OneR with either a formula argument or data.frame argument Perhaps the transpose function is a good example. base:::t, base:::t.data.frame, base:::t.default.

vonjd commented 7 years ago

Thank you, sounds good.

Just to be sure: You hopefully got the same result each time, so that my new code is not only fast, but also correct?

(of course I will conduct a lot of tests before I deploy it to CRAN.)

kohleth commented 7 years ago

yes, I checked the result. From the OneR function perspective, the results are the same. But the code snippets actually produced slightly different result -- same numerical value, but the original snippet return a numeric perf, whereas the new one return an integer perf.

On Thu, Apr 27, 2017 at 4:37 PM, vonjd notifications@github.com wrote:

Thank you, sounds good.

Just to be sure: You hopefully got the same result each time, so that my new code is not only fast, but also correct?

(of course I will conduct a lot of tests before I deploy it to CRAN.)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vonjd/OneR/pull/3#issuecomment-297625148, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAxNsjZD5flIN2ET_xUc7911N_6osCfks5r0DeZgaJpZM4NIZv6 .

-- Kohleth ^_^

vonjd commented 7 years ago

Very interesting observation. I will check whether this could pose some problems later on in the code (but it seems uncritical at the moment).

vonjd commented 7 years ago

Just pushed the new optimized version to github... would be great if you could again try this version with your big dataset - Thank you!

kohleth commented 7 years ago

Hi vonjd,

just tried the github code. It certainly worked fine and well.

On Sat, Apr 29, 2017 at 2:34 AM, vonjd notifications@github.com wrote:

Just pushed the new optimized version to github... would be great if you could again try this version with your big dataset - Thank you!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vonjd/OneR/pull/3#issuecomment-298045637, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAxNl5nqdBvkwPrXffOWYwzEMweMTLJks5r0hT6gaJpZM4NIZv6 .

-- Kohleth ^_^

vonjd commented 7 years ago

Great!

Now I have added the S3 method interface to the OneR and optbin functions for formulas and data frames as suggested. If you could have a look at them that would be very helpful :-)

After additional tests I will push the new version 2.2 to CRAN

kohleth commented 7 years ago

Hi vonjd, I tried the new code and both the formula version and the data.frame version work fine.

Originally when I proposed the data.frame version, I was really thinking about one that takes both an X and y argument, rather than just a X and assumes the last column is the response. But this is purely stylistic and not an issue of right and wrong, I briefly checked other methods, those that take X and y includes: glmnet, caret::train. randomForest, nnet,, lda Those that take a single X includes: (g)lm, rpart,

So there are support for both types.

Thanks again.

On Tue, May 2, 2017 at 1:54 AM, vonjd notifications@github.com wrote:

Great!

Now I have added the S3 method interface to the OneR and optbin functions for formulas and data frames as suggested. If you could have a look at them that would be very helpful :-)

After additional tests I will push the new version 2.2 to CRAN

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vonjd/OneR/pull/3#issuecomment-298360349, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAxNosGJlFHq5ZQtEUsRaKr6uxw0aDwks5r1gBPgaJpZM4NIZv6 .

-- Kohleth ^_^

vonjd commented 7 years ago

@kohleth : Thank you very much for your feedback.

I pushed the new version 2.2 to CRAN and only got positive feedback so far, so everything seems to be alright.

Thank you again

best h

vonjd / OneR

optimized code to work with large dataset. #3