Running Python Models in R

library(reticulate)

Prerequisites

For these methods to work, you will need to point to a Python executable in a Conda environment or Virtualenv that contains all the Python packages you need. You can do this by using a .Rprofile file in your project directory. See the contents of the .Rprofile file in this project to see how I have done this.

Write Python functions to run on a data set in R

In the file python_functions.py I have written the required functions in Python to perform an XGBoost model on an arbitrary data set. We expect all the parameters for these functions to to be in a single dict called parameters. I am now going to source these functions into R so they become R functions that expect equivalent data structures.

source_python("python_functions.py")

Example: Using XGBoost in R

We now use these Python function on a prepared wine dataset in R to try to learn to predict a high quality wine.

First we download data sets for white wines and red wines.

white_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
                        sep = ";")
red_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", 
                      sep = ";")

We will create ‘white versus red’ as a new feature, and we will define ‘High Quality’ to be a quality score of seven or more.

library(dplyr)

white_wines$red <- 0
red_wines$red <- 1

wine_data <- white_wines %>% 
  bind_rows(red_wines) %>% 
  mutate(high_quality = ifelse(quality >= 7, 1, 0)) %>% 
  select(-quality)

knitr::kable(head(wine_data))

fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol
7.0	0.27	0.36	20.7	0.045	45	170	1.0010	3.00	0.45	8.8
6.3	0.30	0.34	1.6	0.049	14	132	0.9940	3.30	0.49	9.5
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9
7.2	0.23	0.32	8.5	0.058	47	186	0.9956	3.19	0.40	9.9
8.1	0.28	0.40	6.9	0.050	30	97	0.9951	3.26	0.44	10.1

Now we set our list of parameters (a list in R is equivalent to a dict in Python):

params <- list(
  input_cols = colnames(wine_data)[colnames(wine_data) != 'high_quality'],
  target_col = 'high_quality',
  test_size = 0.3,
  random_state = 123,
  subsample = (3:9)/10, 
  xgb_max_depth = 3:9,
  colsample_bytree = (3:9)/10,
  xgb_min_child_weight = 1:4,
  k = 3,
  k_shuffle = TRUE,
  n_iter = 10,
  scoring = 'f1',
  error_score = 0,
  verbose = 1,
  n_jobs = -1
)

Now we are ready to run our XGBoost model with 3-fold cross validation. First we split the data:

split <- split_data(df = wine_data,  parameters = params)

This produces a list, which we can feed into our scaling function:

scaled <- scale_data(split$X_train, split$X_test)

Now we can run the XGBoost algorithm with the defined parameters on our training set:

trained <- train_xgb_crossvalidated(
  scaled$X_train_scaled,
  split$y_train,
  parameters = params
)

Finally we can generate a classification report on our test set:

report <- generate_classification_report(trained, scaled$X_test_scaled, split$y_test)

knitr::kable(report)

	precision	recall	f1-score
0.0	0.8859915	0.9377407	0.9111319
1.0	0.6777409	0.5204082	0.5887446
accuracy	0.8538462	0.8538462	0.8538462
macro avg	0.7818662	0.7290744	0.7499382
weighted avg	0.8441278	0.8538462	0.8463238

keithmcnulty / r_and_py_models

readme

Running Python Models in R

Prerequisites

Write Python functions to run on a data set in R

Example: Using XGBoost in R