keithmcnulty / r_and_py_models

How to run Python ML models in R
16 stars 4 forks source link

Error Invalid Argument #1

Open aegerton opened 3 years ago

aegerton commented 3 years ago

Thank you for another excellent example of combining R & Python functionality @keithmcnulty. In trying to copy the example, I have chosen to invoke a Virtualenv using the following code;-

library(reticulate)
virtualenv_create("NSS1", python = "/usr/bin/python3")
virtualenv_install('NSS1', c('pandas', 'numpy', 'xlcompose', 'matplotlib', 'sklearn', 'xgboost'))
use_virtualenv('NSS1', required = T)

This seems to load all the Python functions into the Environment, the first two of which split and scale the datasets satisfactorily. However, I am getting the following error when running the train_xgb_crossvalidated function;-

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  OSError: [Errno 22] Invalid argument

Any idea what the problem might be, please?

keithmcnulty commented 3 years ago

It's hard to tell from what you've posted here - would need to see all your intermediate steps - can you post the precise code you are running and the outputs you are receiving?

keithmcnulty commented 3 years ago

One thought I had is that you don’t seem to have installed scipy. This is needed for some of the distribution parameters in train_xgb_crossvalidated.

aegerton commented 3 years ago

Thanks @keithmcnulty for the prompt and helpful response. It is greatly appreciated. I have now installed scipy, but this doesn't resolve the error. Here's the full R code;-

library(reticulate)
library(dplyr)
virtualenv_create("NSS1", python = "/usr/bin/python3")
virtualenv_install('NSS1', c('pandas', 'numpy', 'xlcompose', 'matplotlib', 'sklearn', 'xgboost', 'scipy'))
use_virtualenv('NSS1', required = T)
source_python("python_functions.py")
white_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep = ";")
red_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep = ";")
white_wines$red <- 0
red_wines$red <- 1
wine_data <- white_wines %>% 
  bind_rows(red_wines) %>% 
  mutate(high_quality = ifelse(quality >= 7, 1, 0)) %>% 
  select(-quality)
params <- list(
  input_cols = colnames(wine_data)[colnames(wine_data) != 'high_quality'],
  target_col = 'high_quality',
  test_size = 0.3,
  random_state = 123,
  subsample = (3:9)/10, 
  xgb_max_depth = 3:9,
  colsample_bytree = (3:9)/10,
  xgb_min_child_weight = 1:4,
  k = 3,
  k_shuffle = TRUE,
  n_iter = 10,
  scoring = 'f1',
  error_score = 0,
  verbose = 1,
  n_jobs = -1
)
split <- split_data(df = wine_data,  parameters = params)
scaled <- scale_data(split$X_train, split$X_test)
trained <- train_xgb_crossvalidated(
  scaled$X_train_scaled,
  split$y_train,
  parameters = params
)

The Python code is a straight copy of your script;-

import pandas as pd
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from xgboost import XGBClassifier

# spit data into train test
def split_data(df: pd.DataFrame, parameters: dict) -> dict:
    """
    Split and select data for modeling
    :param df: Pandas Dataframe
    :param parameters: split paramaters
    :return: Pandas Dataframe
    """
    X = df[parameters["input_cols"]]
    y = df[parameters["target_col"]]
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        test_size=parameters["test_size"],
                                                        random_state=int(parameters["random_state"]))
    return dict(X_train = X_train, X_test = X_test, y_train = y_train, y_test = y_test)

# scale data
def scale_data(X_train: pd.DataFrame, X_test: pd.DataFrame) -> dict:
    """
    Scale data for modelling
    :param X_train: Pandas DataFrame
    :param X_test: Pandas DataFrame
    :return: List of Pandas DataFrames
    """
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
    X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test), columns=X_test.columns)
    return dict(X_train_scaled = X_train_scaled, X_test_scaled = X_test_scaled)

# XGB train CV
def train_xgb_crossvalidated(
        X_train: pd.DataFrame, y_train:pd.DataFrame, parameters: dict
) -> XGBClassifier:
    """
    Train crossvalidated XGB Classifier
    :param X_train: Pandas DataFrame
    :param y_train: Pandas DataFrame
    :param parameters: Parameters for cross validation
    :return: Model of class KNNClassifier
    """
    param_dist = {'n_estimators': stats.randint(1, 100),
                  'learning_rate': stats.uniform(0.01, 0.6),
                  'subsample': parameters['subsample'],
                  'max_depth': parameters['xgb_max_depth'],
                  'colsample_bytree': parameters['colsample_bytree'],
                  'min_child_weight': parameters['xgb_min_child_weight']
                  }
    kfold = KFold(n_splits=int(parameters['k']),
                  shuffle=parameters['k_shuffle'],
                  random_state=int(parameters['random_state']))
    xgbmodel = XGBClassifier(use_label_encoder=False)
    xgb_clf = RandomizedSearchCV(xgbmodel, param_distributions=param_dist,
                                 n_iter=int(parameters['n_iter']), scoring=parameters['scoring'],
                                 error_score=parameters['error_score'], verbose=int(parameters['verbose']),
                                 n_jobs=int(parameters['n_jobs']), cv=kfold, 
                                 random_state=int(parameters['random_state']))
    xgb_clf.fit(X_train, y_train.values.ravel())
    return xgb_clf

# generate classification report
def generate_classification_report(model, X_test: pd.DataFrame, y_test: pd.DataFrame) -> pd.DataFrame:
    """
    Generate classification report for model
    :param model: model object
    :param X_test: Pandas DataFrame
    :param y_test: Pandas Dataframe of test target values
    :return: Classification Report
    """
    y_pred = model.predict(X_test)
    return pd.DataFrame(classification_report(y_test, y_pred, output_dict = True)).transpose().drop('support', axis = 1)

The split and scaled lists are created in R just fine, but when the train_xgb_crossvalidated function is run, the following Error Message with traceback results;-

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  OSError: [Errno 22] Invalid argument

Detailed traceback: 
  File "<string>", line 64, in train_xgb_crossvalidated
  File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\sklearn\model_selection\_search.py", line 841, in fit
    self._run_search(evaluate_candidates)
  File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\sklearn\model_selection\_search.py", line 1619, in _run_search
    evaluate_candidates(ParameterSampler(
  File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\sklearn\model_selection\_search.py", line 795, in evaluate_candidates
    out = parallel(delayed(_fit_and_score)(clone(base_estimator),
  File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "D:\OneDrive\Documents\.virtualenvs

Here is the sessionInfo;-

R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_Singapore.1252 
[2] LC_CTYPE=English_Singapore.1252   
[3] LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=English_Singapore.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] dplyr_1.0.5     reticulate_1.18 devtools_2.3.2  usethis_2.0.1  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        compiler_4.0.5    pillar_1.5.1     
 [4] prettyunits_1.1.1 remotes_2.2.0     tools_4.0.5      
 [7] testthat_3.0.2    pkgbuild_1.2.0    pkgload_1.2.0    
[10] jsonlite_1.7.2    memoise_2.0.0     lifecycle_1.0.0  
[13] tibble_3.1.0      lattice_0.20-41   pkgconfig_2.0.3  
[16] rlang_0.4.10      Matrix_1.3-2      DBI_1.1.1        
[19] cli_2.3.1         xfun_0.21         fastmap_1.1.0    
[22] withr_2.4.1       knitr_1.31        desc_1.2.0       
[25] generics_0.1.0    fs_1.5.0          vctrs_0.3.6      
[28] tidyselect_1.1.0  rprojroot_2.0.2   grid_4.0.5       
[31] glue_1.4.2        R6_2.5.0          processx_3.4.5   
[34] fansi_0.4.2       sessioninfo_1.1.1 callr_3.5.1      
[37] purrr_0.3.4       magrittr_2.0.1    ps_1.5.0         
[40] ellipsis_0.3.1    assertthat_0.2.1  utf8_1.2.1       
[43] cachem_1.0.4      crayon_1.4.1     
Fitting 3 folds for each of 10 candidates, totalling 30 fits

Apologies for dumping all of this on you. My Python knowledge is extremely basic, but I am keen to explore this ability to combine both languages.

What stupid mistake am I making, please?

keithmcnulty commented 3 years ago

I can't be sure, but that specific error is associated with problems with file paths. I see the traceback seems to be referencing OneDrive paths, but your early code seems to have a Linux/MacOS path to Python (/usr/bin/python3). Generally with windows paths backslashes need to be escaped when used in Python. What OS are you using and why is your virtualenv in a OneDrive path?