Open aegerton opened 3 years ago
It's hard to tell from what you've posted here - would need to see all your intermediate steps - can you post the precise code you are running and the outputs you are receiving?
One thought I had is that you don’t seem to have installed scipy. This is needed for some of the distribution parameters in train_xgb_crossvalidated
.
Thanks @keithmcnulty for the prompt and helpful response. It is greatly appreciated. I have now installed scipy
, but this doesn't resolve the error. Here's the full R code;-
library(reticulate)
library(dplyr)
virtualenv_create("NSS1", python = "/usr/bin/python3")
virtualenv_install('NSS1', c('pandas', 'numpy', 'xlcompose', 'matplotlib', 'sklearn', 'xgboost', 'scipy'))
use_virtualenv('NSS1', required = T)
source_python("python_functions.py")
white_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep = ";")
red_wines <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep = ";")
white_wines$red <- 0
red_wines$red <- 1
wine_data <- white_wines %>%
bind_rows(red_wines) %>%
mutate(high_quality = ifelse(quality >= 7, 1, 0)) %>%
select(-quality)
params <- list(
input_cols = colnames(wine_data)[colnames(wine_data) != 'high_quality'],
target_col = 'high_quality',
test_size = 0.3,
random_state = 123,
subsample = (3:9)/10,
xgb_max_depth = 3:9,
colsample_bytree = (3:9)/10,
xgb_min_child_weight = 1:4,
k = 3,
k_shuffle = TRUE,
n_iter = 10,
scoring = 'f1',
error_score = 0,
verbose = 1,
n_jobs = -1
)
split <- split_data(df = wine_data, parameters = params)
scaled <- scale_data(split$X_train, split$X_test)
trained <- train_xgb_crossvalidated(
scaled$X_train_scaled,
split$y_train,
parameters = params
)
The Python code is a straight copy of your script;-
import pandas as pd
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from xgboost import XGBClassifier
# spit data into train test
def split_data(df: pd.DataFrame, parameters: dict) -> dict:
"""
Split and select data for modeling
:param df: Pandas Dataframe
:param parameters: split paramaters
:return: Pandas Dataframe
"""
X = df[parameters["input_cols"]]
y = df[parameters["target_col"]]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=parameters["test_size"],
random_state=int(parameters["random_state"]))
return dict(X_train = X_train, X_test = X_test, y_train = y_train, y_test = y_test)
# scale data
def scale_data(X_train: pd.DataFrame, X_test: pd.DataFrame) -> dict:
"""
Scale data for modelling
:param X_train: Pandas DataFrame
:param X_test: Pandas DataFrame
:return: List of Pandas DataFrames
"""
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test), columns=X_test.columns)
return dict(X_train_scaled = X_train_scaled, X_test_scaled = X_test_scaled)
# XGB train CV
def train_xgb_crossvalidated(
X_train: pd.DataFrame, y_train:pd.DataFrame, parameters: dict
) -> XGBClassifier:
"""
Train crossvalidated XGB Classifier
:param X_train: Pandas DataFrame
:param y_train: Pandas DataFrame
:param parameters: Parameters for cross validation
:return: Model of class KNNClassifier
"""
param_dist = {'n_estimators': stats.randint(1, 100),
'learning_rate': stats.uniform(0.01, 0.6),
'subsample': parameters['subsample'],
'max_depth': parameters['xgb_max_depth'],
'colsample_bytree': parameters['colsample_bytree'],
'min_child_weight': parameters['xgb_min_child_weight']
}
kfold = KFold(n_splits=int(parameters['k']),
shuffle=parameters['k_shuffle'],
random_state=int(parameters['random_state']))
xgbmodel = XGBClassifier(use_label_encoder=False)
xgb_clf = RandomizedSearchCV(xgbmodel, param_distributions=param_dist,
n_iter=int(parameters['n_iter']), scoring=parameters['scoring'],
error_score=parameters['error_score'], verbose=int(parameters['verbose']),
n_jobs=int(parameters['n_jobs']), cv=kfold,
random_state=int(parameters['random_state']))
xgb_clf.fit(X_train, y_train.values.ravel())
return xgb_clf
# generate classification report
def generate_classification_report(model, X_test: pd.DataFrame, y_test: pd.DataFrame) -> pd.DataFrame:
"""
Generate classification report for model
:param model: model object
:param X_test: Pandas DataFrame
:param y_test: Pandas Dataframe of test target values
:return: Classification Report
"""
y_pred = model.predict(X_test)
return pd.DataFrame(classification_report(y_test, y_pred, output_dict = True)).transpose().drop('support', axis = 1)
The split and scaled lists are created in R just fine, but when the train_xgb_crossvalidated
function is run, the following Error Message with traceback results;-
Error in py_call_impl(callable, dots$args, dots$keywords) :
OSError: [Errno 22] Invalid argument
Detailed traceback:
File "<string>", line 64, in train_xgb_crossvalidated
File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\sklearn\model_selection\_search.py", line 841, in fit
self._run_search(evaluate_candidates)
File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\sklearn\model_selection\_search.py", line 1619, in _run_search
evaluate_candidates(ParameterSampler(
File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\sklearn\model_selection\_search.py", line 795, in evaluate_candidates
out = parallel(delayed(_fit_and_score)(clone(base_estimator),
File "D:\OneDrive\Documents\.virtualenvs\NSS1\lib\site-packages\joblib\parallel.py", line 1041, in __call__
if self.dispatch_one_batch(iterator):
File "D:\OneDrive\Documents\.virtualenvs
Here is the sessionInfo
;-
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_Singapore.1252
[2] LC_CTYPE=English_Singapore.1252
[3] LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_Singapore.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] dplyr_1.0.5 reticulate_1.18 devtools_2.3.2 usethis_2.0.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 compiler_4.0.5 pillar_1.5.1
[4] prettyunits_1.1.1 remotes_2.2.0 tools_4.0.5
[7] testthat_3.0.2 pkgbuild_1.2.0 pkgload_1.2.0
[10] jsonlite_1.7.2 memoise_2.0.0 lifecycle_1.0.0
[13] tibble_3.1.0 lattice_0.20-41 pkgconfig_2.0.3
[16] rlang_0.4.10 Matrix_1.3-2 DBI_1.1.1
[19] cli_2.3.1 xfun_0.21 fastmap_1.1.0
[22] withr_2.4.1 knitr_1.31 desc_1.2.0
[25] generics_0.1.0 fs_1.5.0 vctrs_0.3.6
[28] tidyselect_1.1.0 rprojroot_2.0.2 grid_4.0.5
[31] glue_1.4.2 R6_2.5.0 processx_3.4.5
[34] fansi_0.4.2 sessioninfo_1.1.1 callr_3.5.1
[37] purrr_0.3.4 magrittr_2.0.1 ps_1.5.0
[40] ellipsis_0.3.1 assertthat_0.2.1 utf8_1.2.1
[43] cachem_1.0.4 crayon_1.4.1
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Apologies for dumping all of this on you. My Python knowledge is extremely basic, but I am keen to explore this ability to combine both languages.
What stupid mistake am I making, please?
I can't be sure, but that specific error is associated with problems with file paths. I see the traceback seems to be referencing OneDrive paths, but your early code seems to have a Linux/MacOS path to Python (/usr/bin/python3
). Generally with windows paths backslashes need to be escaped when used in Python. What OS are you using and why is your virtualenv in a OneDrive path?
Thank you for another excellent example of combining R & Python functionality @keithmcnulty. In trying to copy the example, I have chosen to invoke a
Virtualenv
using the following code;-This seems to load all the Python functions into the Environment, the first two of which split and scale the datasets satisfactorily. However, I am getting the following error when running the
train_xgb_crossvalidated
function;-Any idea what the problem might be, please?