keithmcnulty / r_and_py_models

How to run Python ML models in R
16 stars 4 forks source link

Running Python Models in R

library(reticulate)

Prerequisites

For these methods to work, you will need to point to a Python executable in a Conda environment or Virtualenv that contains all the Python packages you need. You can do this by using a .Rprofile file in your project directory. See the contents of the .Rprofile file in this project to see how I have done this.

Write Python functions to run on a data set in R

In the file python_functions.py I have written the required functions in Python to perform an XGBoost model on an arbitrary data set. We expect all the parameters for these functions to to be in a single dict called parameters. I am now going to source these functions into R so they become R functions that expect equivalent data structures.

source_python("python_functions.py")

Example: Using XGBoost in R

We now use these Python function on a prepared wine dataset in R to try to learn to predict a high quality wine.

First we download data sets for white wines and red wines.

white_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
                        sep = ";")
red_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", 
                      sep = ";")

We will create ‘white versus red’ as a new feature, and we will define ‘High Quality’ to be a quality score of seven or more.

library(dplyr)

white_wines$red <- 0
red_wines$red <- 1

wine_data <- white_wines %>% 
  bind_rows(red_wines) %>% 
  mutate(high_quality = ifelse(quality >= 7, 1, 0)) %>% 
  select(-quality)

knitr::kable(head(wine_data))
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol red high_quality
7.0 0.27 0.36 20.7 0.045 45 170 1.0010 3.00 0.45 8.8 0 0
6.3 0.30 0.34 1.6 0.049 14 132 0.9940 3.30 0.49 9.5 0 0
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 0 0
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 0 0
7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 0 0
8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 0 0

Now we set our list of parameters (a list in R is equivalent to a dict in Python):

params <- list(
  input_cols = colnames(wine_data)[colnames(wine_data) != 'high_quality'],
  target_col = 'high_quality',
  test_size = 0.3,
  random_state = 123,
  subsample = (3:9)/10, 
  xgb_max_depth = 3:9,
  colsample_bytree = (3:9)/10,
  xgb_min_child_weight = 1:4,
  k = 3,
  k_shuffle = TRUE,
  n_iter = 10,
  scoring = 'f1',
  error_score = 0,
  verbose = 1,
  n_jobs = -1
)

Now we are ready to run our XGBoost model with 3-fold cross validation. First we split the data:

split <- split_data(df = wine_data,  parameters = params)

This produces a list, which we can feed into our scaling function:

scaled <- scale_data(split$X_train, split$X_test)

Now we can run the XGBoost algorithm with the defined parameters on our training set:

trained <- train_xgb_crossvalidated(
  scaled$X_train_scaled,
  split$y_train,
  parameters = params
)

Finally we can generate a classification report on our test set:

report <- generate_classification_report(trained, scaled$X_test_scaled, split$y_test)

knitr::kable(report)
precision recall f1-score
0.0 0.8859915 0.9377407 0.9111319
1.0 0.6777409 0.5204082 0.5887446
accuracy 0.8538462 0.8538462 0.8538462
macro avg 0.7818662 0.7290744 0.7499382
weighted avg 0.8441278 0.8538462 0.8463238