ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.12k stars 79 forks source link

skimr against a database #358

Open CerebralMastication opened 5 years ago

CerebralMastication commented 5 years ago

Hey kind skimr folks. For the last year or so I've been pondering how nice it would be to run skim against my big ol' Redshift database. I suspect I'm not the only one who's thought about this. The general idea would be to have as much code as possible execute on the database using the magic of dbplyr.

Today I decided to spend some time trying to understand the skimr code base and think about what might be necessary in order to refactor the code into functions that can play with the limited subset of functions that dbplyr can execute in SQL.

I've done a quick pass through the code and before I started really digging into this I wanted to see if any of you wise folks had given this thought or maybe seen something else implemented elsewhere. It seems like a database friendly skimr would provide a lot of value, it's not a trivial exercise to refactor skimr.

Any input you all have would be much appreciated.

michaelquinn32 commented 5 years ago

Hi JD!

This is something that we could eventually support. It would be easier to do v2, which has a simpler codebase. I'm currently on pat leave, and I had hoped to spend some time on skimr, but it's going slowly. The last time I talked to @elinw, we were targeting a v2 release at the beginning of next year.

After that, we can start fleshing out features like this.

Thanks!

CerebralMastication commented 5 years ago

Congratulations on the new new Mini Mighty Quinn!

What I'm hearing you say is that if I want to hack around the edges on this I should be using the v2 branch :)

I'm going to also look at the code base for dbplot as some of the functions @edgararuiz has put in there may be helpful. In particular these:

db_compute_bins() - Returns a data frame with the bins and count per bin db_compute_count() - Returns a data frame with the count per discrete value db_compute_raster() - Returns a data frame with the results per x/y intersection db_compute_boxplot() - Returns a data frame with boxplot calculations

edgararuiz-zz commented 5 years ago

There may be some opportunities to expand on those functions by passing multiple calculations in one query instead of one per field in a given view or table. Let me know if you'd like me to lend a hand on this one.

CerebralMastication commented 5 years ago

There may be some opportunities to expand on those functions by passing multiple calculations in one query instead of one per field in a given view or table. Let me know if you'd like me to lend a hand on this one.

That's a grand idea Edgar. I think refactoring those to do multiple fields would be a helpful step. Any help is always appreciated. Plus I'll learn a few things along the way.

elinw commented 5 years ago

I just merged updates into the master and develop branches so that skimr will work with dplyr 0.8.0. I'm going to see how it goes with v2 but I'd suggest if you are working on this idea that you pull down the rc branch for dplyr and work from there.

elinw commented 5 years ago

Now the updates are in the v2 branch. So I am hoping to release a final v1 pretty soon and definitely before the new dplyr eta of February 1. That's also around the same time that I hope to release v2 which is looking pretty good.

elinw commented 5 years ago

I've been looking at this a tiny bit. Right now this works

# not in the data base
flights_db %>% select(flights_db$ops$vars) %>% skim()

# in the data base
flights_db %>%  db_compute_count(dep_delay) 

Because skim() in v2 is totally focused on data frames I'm wondering if we would want to create skim_db() that would work like skim() but instead of checking to see if data is a data frame, it would check to see if it is a data base or more specifically a dbplyr connection to a data base. Probably by checking for "tbl_sql" in the classes.

Then you'd want to have skimmers that are for databases, so limited to what can be done there.

I think we would likely have to do or require users to do some preprocessing. For example we'd need the variable names and the types. You can get those with a prior round trip to the database, and trying to make it as generic as possible (because supporting multiple databases would make doing queries that are too specific a problem).

db_head <- flights_db %>% head(0) %>% as.data.frame()
db_vars <- colnames(db_head)
db_classes <- lapply(db_head, class)

Then we use that with our skimmers to go do the work of calculating the statistics.

Then back to R to update the resulting data frame and on to the next. Then print. And that will replicate what happens in skim() but then we could try to make the performance in both versions better.

tedmoorman commented 4 years ago

I'd like to upvote this suggestion/issue, if such a thing is possible. This is a fabulous package. If it were more database compatible, the improvement in speed would make it that much better. For the particular database I'm using, the following snippet of code runs reasonably well, with an odbc database connection using dplyr, dplyr, etc.

my_db %>% select(<a few columns>) %>% distinct_all() %>% collect() %>% select_if(~!bit64::is.integer64(.)) %>% skim()

michaelquinn32 commented 4 years ago

Thanks!

V2 complicates this a bit. To get a collection of skimmer functions for a type of data in your data frame, we rely on S3. The function get_skimmers() is a generic, and each type of data returns a different method.

See here: https://github.com/ropensci/skimr/blob/966b8655ba521cfad5dd0a725685f1e27f21a96c/R/skim_with.R#L197-L200

And see here: https://github.com/ropensci/skimr/blob/master/R/get_skimmers.R

This was a particular design decision to make skimmer much more extensible. Instead of us having to create some sort of dispatch system for every possible class of data (like in v1), we can let other developers define their own methods.

If we're going to a DB, how do we get the types of columns that we are summarizing?