I was trying to run a logistic regression with grouping columns using madlib.glm and it was much slower than running it directly on the database. I wondered why and after looking at the code and the activity on the database, it seems that to populate the field "nobs" in the regression result there is a count done for each group, which can take a long time. Wouldn't it be better to simply use the fields "num_rows_processed" and "num_rows_skipped" from the madlib output table?
I agree. The PivotalR code was written before MADlib returned those two numbers. We need to update the code to use them instead of computing on our own.
I was trying to run a logistic regression with grouping columns using madlib.glm and it was much slower than running it directly on the database. I wondered why and after looking at the code and the activity on the database, it seems that to populate the field "nobs" in the regression result there is a count done for each group, which can take a long time. Wouldn't it be better to simply use the fields "num_rows_processed" and "num_rows_skipped" from the madlib output table?