greenplum-db / PivotalR-archive

An convenient R tool for manipulating tables in PostgreSQL type databases and a wrapper of Apache MADlib.
https://pivotalsoftware.github.io/gp-r/
125 stars 53 forks source link

Results from by processing limited to 100 rows #43

Open ashokrags opened 8 years ago

ashokrags commented 8 years ago

hi, when I run a by processing in Hawq using a custom function, i only get results for 100 rows. Any ideas as to why that occurs?

if i do something like this samples_tissue_mu_logcpm <- by(gtex_df[, "logCpm"], c( gtex_df$gene, gtex_df$Tissue_type),mean )`

Then i get the correct number of rows 112636 however if i submit my own function or a non-standard function as below I only get 100 results retrieved samples_tissue_logcpm_q_lo <- by(gtex_df[, "logCpm"], c( gtex_df$gene, gtex_df$Tissue_type), FUN=function(x) { y <- lookat(x, nrows=NULL) return(quantile(y, prob=0.25 )) })

iyerr3 commented 8 years ago

lookat by default returns 100 rows, set nrows=-1 to get all rows.

ashokrags commented 8 years ago

Thanks. So nrows=NULL will not work?? It works under other circumstances. I will test with nrows=-1 and let you know if it worked

iyerr3 commented 8 years ago

I missed the NULL input in your call. That is supposed to work - I'll have to debug this further.

orhankislal commented 7 years ago

It seems to work as intended. I have tried the following commands to replicate the issue.

lof2
[[1]]
Table : "abalone"
Database : madlib-gpdb43
Host : 127.0.0.1
Connection : 1
lapply(lof2, FUN=function(x) { y <- lookat(x); return(quantile(y$id, probs=0.25 )) })
[[1]]
25%
102
lapply(lof2, FUN=function(x) { y <- lookat(x, nrows=100); return(quantile(y$id, probs=0.25 )) })
[[1]]
25%
100
lapply(lof2, FUN=function(x) { y <- lookat(x, nrows=NULL); return(quantile(y$id, probs=0.25 )) })
[[1]]
25%
1045
lapply(lof2, FUN=function(x) { y <- lookat(x, nrows=-1); return(quantile(y$id, probs=0.25 )) })
[[1]]
25%
1045
quantile(abalone$id, prob=0.25)
25%
1045

I also tried tapply (the function by wraps):

> m <- matrix(c(db.abalone,db.abalone2,1,2), nrow=2)
> fac <- factor(rep(1:2, length = 2), levels = 1:2)
> tapply(m[,1], fac, FUN=function(x) { y <- lookat(x[[1]], nrows=100); return(quantile(y$id, probs=0.25 )) })
  1   2
102 101
> tapply(m[,1], fac, FUN=function(x) { y <- lookat(x[[1]], nrows=NULL); return(quantile(y$id, probs=0.25 )) })
   1    2
1045 1045
> tapply(m[,1], fac, FUN=function(x) { y <- lookat(x[[1]], nrows=-1); return(quantile(y$id, probs=0.25 )) })
   1    2
1045 1045

@ashokrags Do you have a by example that I can use to reproduce this error?

ashokrags commented 7 years ago

@orhankislal I have given a by example above. How many rows does the function return when you apply it over the entire data frame? For example if you made fac 1000 levels and then did a by processing so you get summary for each level does it return 1000 rows?? The inbuilt mean function does that

orhankislal commented 7 years ago

@ashokrags I saw the example you gave but it is not clear what the gtex_df structure looks like. I recently started looking into PivotalR but my understanding is that lookat requires a db.table type object. That is why the matrix I created in my second example has multiple db.table objects (so that lookat can look at each one).

fmcquillan99 commented 7 years ago

seems to be working OK from our perspective so @ashokrags please let us know if you are still having issues

ashokrags commented 7 years ago

@orhankislal @fmcquillan99 .... the gtex_df is a table with 555Million rows and several columns I think 6 or 7. All I want is an aggregate summary statistic by groups of a particular column ( say that has 300 different categories). So like i mentioned before if use the inbuilt mean function it works, but when i use a custom function it only returns 100 rows. If it appears not to be an issue from your side, then I think i could some time-out issue in what gets returned within the implementation in our side. Thanks for checking into this. I will reopen this issue if i can get any more information that it is an issue from the R side. Thanks a lot again for taking the time to look into this.