sberbank-ai-lab / LightAutoML

LAMA - automatic model creation framework
Apache License 2.0
894 stars 92 forks source link

Task never completes (multiclass) #50

Closed darenr closed 3 years ago

darenr commented 3 years ago

This script below ran for many hours (MacbookPro (current Intel model), no GPU, 16GB RAM) before I killed it, some runs give me an error but it keeps going: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

I set a timeout of an hour which seems to be ignored, I've tried playing with different algorithms but can't this to make a model, every time I give up after running it all night.

import numpy as np
import pandas as pd
from lightautoml.automl.presets.text_presets import TabularAutoML, TabularNLPAutoML
from lightautoml.tasks import Task
from sklearn import preprocessing

from sklearn.model_selection import train_test_split

df = pd.read_json("https://github.com/nomadotto/News_Classifier/blob/master/News_Category_Dataset_v2.json?raw=true", lines=True)

print(df.head())

automl = TabularNLPAutoML(
    task=Task("multiclass"),
    timeout=3600,
    verbose=2,
    general_params={"use_algos": ["lgb", "cb"]},
    reader_params={"cv": 5, "random_state": 42},
    text_params={"lang": "en"},
    gbm_pipeline_params={"text_features": "tfidf"},
    tfidf_params={
        "svd": True,
        "tfidf_params": {
            "ngram_range": (1, 2),
            "sublinear_tf": True,
            "max_features": 1500,
        },
    },
)

print("splitting...")
df_train, df_test = train_test_split(
    df,
    test_size=0.2,
    shuffle=True,
    random_state=42,
)

print("fitting...")
oof_pred = automl.fit_predict(
    df_train,
    roles={
        "target": "category",
        "text": ["headline", "short_description"],
        "drop": ["authors", "link", "date"]
    }
)

print(oof_pred)
DESimakov commented 3 years ago

Sorry for the late response.

The reason of the slow speed is LightGBM. It builds one tree per class, so the total number of iterations increase by 41 times (number of classes).

Possible ways to speed up the training time: 1) decrease the number of features: for example 'n_components': 10 (from 100) of SVD. 2) disable LightGBM (but Catboost is slightly faster). Linear model is the fastest one.