Open CerebralMastication opened 5 years ago
Hi JD!
This is something that we could eventually support. It would be easier to do v2, which has a simpler codebase. I'm currently on pat leave, and I had hoped to spend some time on skimr, but it's going slowly. The last time I talked to @elinw, we were targeting a v2 release at the beginning of next year.
After that, we can start fleshing out features like this.
Thanks!
Congratulations on the new new Mini Mighty Quinn!
What I'm hearing you say is that if I want to hack around the edges on this I should be using the v2
branch :)
I'm going to also look at the code base for dbplot
as some of the functions @edgararuiz has put in there may be helpful. In particular these:
db_compute_bins()
- Returns a data frame with the bins and count per bin
db_compute_count()
- Returns a data frame with the count per discrete value
db_compute_raster()
- Returns a data frame with the results per x/y intersection
db_compute_boxplot()
- Returns a data frame with boxplot calculations
There may be some opportunities to expand on those functions by passing multiple calculations in one query instead of one per field in a given view or table. Let me know if you'd like me to lend a hand on this one.
There may be some opportunities to expand on those functions by passing multiple calculations in one query instead of one per field in a given view or table. Let me know if you'd like me to lend a hand on this one.
That's a grand idea Edgar. I think refactoring those to do multiple fields would be a helpful step. Any help is always appreciated. Plus I'll learn a few things along the way.
I just merged updates into the master and develop branches so that skimr will work with dplyr 0.8.0. I'm going to see how it goes with v2 but I'd suggest if you are working on this idea that you pull down the rc branch for dplyr and work from there.
Now the updates are in the v2 branch. So I am hoping to release a final v1 pretty soon and definitely before the new dplyr eta of February 1. That's also around the same time that I hope to release v2 which is looking pretty good.
I've been looking at this a tiny bit. Right now this works
# not in the data base
flights_db %>% select(flights_db$ops$vars) %>% skim()
# in the data base
flights_db %>% db_compute_count(dep_delay)
Because skim()
in v2 is totally focused on data frames I'm wondering if we would want to create skim_db() that would work like skim() but instead of checking to see if data is a data frame, it would check to see if it is a data base or more specifically a dbplyr connection to a data base. Probably by checking for "tbl_sql"
in the classes.
Then you'd want to have skimmers that are for databases, so limited to what can be done there.
I think we would likely have to do or require users to do some preprocessing. For example we'd need the variable names and the types. You can get those with a prior round trip to the database, and trying to make it as generic as possible (because supporting multiple databases would make doing queries that are too specific a problem).
db_head <- flights_db %>% head(0) %>% as.data.frame()
db_vars <- colnames(db_head)
db_classes <- lapply(db_head, class)
Then we use that with our skimmers to go do the work of calculating the statistics.
Then back to R to update the resulting data frame and on to the next. Then print. And that will replicate what happens in skim() but then we could try to make the performance in both versions better.
I'd like to upvote this suggestion/issue, if such a thing is possible. This is a fabulous package. If it were more database compatible, the improvement in speed would make it that much better. For the particular database I'm using, the following snippet of code runs reasonably well, with an odbc database connection using dplyr, dplyr, etc.
my_db %>% select(<a few columns>) %>% distinct_all() %>% collect() %>% select_if(~!bit64::is.integer64(.)) %>% skim()
Thanks!
V2 complicates this a bit. To get a collection of skimmer functions for a type of data in your data frame, we rely on S3. The function get_skimmers()
is a generic, and each type of data returns a different method.
And see here: https://github.com/ropensci/skimr/blob/master/R/get_skimmers.R
This was a particular design decision to make skimmer much more extensible. Instead of us having to create some sort of dispatch system for every possible class of data (like in v1), we can let other developers define their own methods.
If we're going to a DB, how do we get the types of columns that we are summarizing?
Hey kind
skimr
folks. For the last year or so I've been pondering how nice it would be to runskim
against my big ol' Redshift database. I suspect I'm not the only one who's thought about this. The general idea would be to have as much code as possible execute on the database using the magic ofdbplyr
.Today I decided to spend some time trying to understand the
skimr
code base and think about what might be necessary in order to refactor the code into functions that can play with the limited subset of functions thatdbplyr
can execute in SQL.I've done a quick pass through the code and before I started really digging into this I wanted to see if any of you wise folks had given this thought or maybe seen something else implemented elsewhere. It seems like a database friendly
skimr
would provide a lot of value, it's not a trivial exercise to refactorskimr
.Any input you all have would be much appreciated.