mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.05k stars 406 forks source link

NameError: name '_' is not defined #275

Closed danilofreire closed 3 years ago

danilofreire commented 3 years ago

Hi! :)

Thanks for the great software! The AutoML() function throws the following error when I try to use it on Mac OS Catalina (10.15.7): NameError: name '_' is not defined. I tried to load from django.utils.translation import gettext as _, which I read somewhere that it could help, but to no avail. I'm using Python 3.8.5 if that helps.

Any information is greatly appreciated! Many thanks!

pplonski commented 3 years ago

Hi @danilofreire, do you see any errors during the installation? Are you able to install dtreeviz package on your machine without problems? Are you installing with pip?

pip install dtreeviz

There is sometimes problem with graphviz package when installing the mljar-supervised and dtreeviz depends on it.

danilofreire commented 3 years ago

Hi @pplonski, many thanks for your quick reply. I installed dtreeviz==1.0 and graphviz==0.9 with pip and unfortunately the error persists when I run the AutoML() command. I've just noticed that the error appears after the following lines:

5_Default_NeuralNetwork logloss 0.092119 trained in 1.75 seconds
NameError: name '_' is not defined

Maybe I should exclude the Neural Networks from the model? Thanks again!

pplonski commented 3 years ago

OK, so you are able to run AutoML - good. Could you send the code snippet that you are using (ideally with dataset, or link to it)? So I can try to reproduce the bug.

For other models you don't see such error?

danilofreire commented 3 years ago

Yes, AutoML() does run here. I wrote some code in R and I call the AutoML() function via the reticulate R package. Here is the full code:

### R

# Install and load required packages
if (!require("reticulate")) {
    install.packages("reticulate")
}
if (!require("tidyverse")) {
    install.packages("tidyverse")
}

## Data wrangling

# Load data and select variables
load("fl.three.RData")

fl_data <- fl.three %>%
  select(onset, warl, gdpenl, lpopl1,
         lmtnest, ncontig, Oil, nwstate, instab,
         polity2l, ethfrac, relfrac) %>%
  mutate(onset = if_else(onset >= 1, 1, onset),
         onset = as.factor(onset),
         oil = Oil) %>%
  select(-Oil)

# Independent variables, dependent variable
fl_x <- fl_data %>% select(-onset)
fl_y <- fl_data %>% select(onset)

### Python

repl_python()
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, roc_auc_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(r.fl_x, r.fl_y, 
train_size=0.75, test_size=0.25, stratify=r.fl_y, random_state=48924)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

## mljar-supervised
from supervised.automl import AutoML
automl = AutoML(total_time_limit=900, random_state=48924)
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

The complete dataset is available here. I was able to run automl models using h2o and tpot via reticulate with no errors. Thanks a lot for your help!

pplonski commented 3 years ago

Could you check if there is still an error when you save train data to CSV file and load it in pure python script? You can send me CSV file, so I can check this. This might be the bug in the package.

If you are looking for the best ML model (like in the kaggle-competition style), please set the mode="Compete".

danilofreire commented 3 years ago

Hi, @pplonski! AutoML() runs without errors when I load it in pure Python, as you suggested. I think it's an issue with how reticulate handles Python variables. Many thanks!