nobs in madlib regressions

greenplum-db / PivotalR-archive

An convenient R tool for manipulating tables in PostgreSQL type databases and a wrapper of Apache MADlib.

https://pivotalsoftware.github.io/gp-r/

125 stars 53 forks source link

nobs in madlib regressions #36

Closed odhobb closed 9 years ago

odhobb commented 9 years ago

I was trying to run a logistic regression with grouping columns using madlib.glm and it was much slower than running it directly on the database. I wondered why and after looking at the code and the activity on the database, it seems that to populate the field "nobs" in the regression result there is a count done for each group, which can take a long time. Wouldn't it be better to simply use the fields "num_rows_processed" and "num_rows_skipped" from the madlib output table?

iyerr3 commented 9 years ago

I agree. The PivotalR code was written before MADlib returned those two numbers. We need to update the code to use them instead of computing on our own.