tidymodels / tidypredict

Run predictions inside the database
https://tidypredict.tidymodels.org
Other
258 stars 31 forks source link

[Feature Request] case statements needed for grouped regression #26

Open alexhallam opened 6 years ago

alexhallam commented 6 years ago

I use grouped regressions often (aka Many Models). Though your package makes it mostly painless to create grouped model sql code, maybe it would be nice to have a convenience function to do this.

Hopefully the code below illustrates what I am thinking.

# load libs
library(tidyverse)
library(tidypredict)
library(gapminder)

# make model function
country_model <- function(df) {
  lm(lifeExp ~ year, data = df)
}

# nest data for grouped regression
by_country <- gapminder %>% 
  group_by(country, continent) %>% 
  nest() %>% 
  mutate(model = map(data, country_model)) %>% 
  # use tidypredict to make sql strings by model
  mutate(sql_txt = map(model,~tidypredict_sql(., dbplyr::simulate_mssql()))) %>% 
  # add some case statements -- more work needed here
  mutate(case_string = paste0("WHEN COUNTRY = ", "'", country, "'", " THEN ")) %>% 
  # paste case statements to model sql
  mutate(full_string = paste0(case_string, sql_txt))

# paste all of the models together with case statements and return one big glob
# of text for sql
by_country %>% 
  unnest(full_string) %>% 
  summarise(full_sql_text = toString(full_string)) %>% 
  as.list()

The first few entries are shown below. I am sure there are many ways to implement this in a cleaner fashion than what I did.

WHEN COUNTRY = 'Afghanistan' THEN -507.534271561773 + (`year`) * (0.275328671328672), 
WHEN COUNTRY = 'Albania' THEN -594.072510955711 + (`year`) * (0.334683216783217),
WHEN COUNTRY = 'Algeria' THEN -1067.85903962704 + (`year`) * (0.569279720279723)
...

Would it be possible to add some convenience function to do some lifting for building case statements needed for grouped regression?

edgararuiz-zz commented 6 years ago

Hi Alex, thanks for the suggestion, let me take a look and see what we could do via the package.

edgararuiz-zz commented 6 years ago

Actually, I have another package called modeldb that is able to fit models inside databases, and it also respects grouped data:

library(tidyverse)
library(gapminder)
library(modeldb)
gapminder %>%
  group_by(country) %>%
  select(lifeExp, year) %>%
  linear_regression_db(lifeExp)
#> Adding missing grouping variables: `country`
#> # A tibble: 142 x 3
#>    country     `(Intercept)`  year
#>    <fct>               <dbl> <dbl>
#>  1 Afghanistan         -508. 0.275
#>  2 Albania             -594. 0.335
#>  3 Algeria            -1068. 0.569
#>  4 Angola              -377. 0.209
#>  5 Argentina           -390. 0.232
#>  6 Australia           -376. 0.228
#>  7 Austria             -406. 0.242
#>  8 Bahrain             -860. 0.468
#>  9 Bangladesh          -936. 0.498
#> 10 Belgium             -340. 0.209
#> # ... with 132 more rows

This package works with tidypredict but at this time, only with un-grouped data. If modeldb fits the models that you ultimately need, maybe we should focus on adding grouped model capability in tidypredict but based on the output from modeldb.

Thanks!

alexhallam commented 6 years ago

Does this translate to sql text?

edgararuiz-zz commented 6 years ago

Not yet, that's the portion I need to code, but wanted first to find out if it would be useful for anyone.