Documentation | Slide (Japanese)
nyaggle is an utility library for Kaggle and offline competitions. It is particularly focused on experiment tracking, feature engineering, and validation.
You can install nyaggle via pip:
pip install nyaggle
run_experiment()
is a high-level API for experiments with cross validation.
It outputs parameters, metrics, out of fold predictions, test predictions,
feature importance, and submission.csv under the specified directory.
To enable mlflow tracking, include the optional with_mlflow=True
parameter.
from sklearn.model_selection import train_test_split
from nyaggle.experiment import run_experiment
from nyaggle.testing import make_classification_df
X, y = make_classification_df()
X_train, X_test, y_train, y_test = train_test_split(X, y)
params = {
'n_estimators': 1000,
'max_depth': 8
}
result = run_experiment(params,
X_train,
y_train,
X_test)
# You can get outputs that are needed in data science competitions with 1 API
print(result.test_prediction) # Test prediction in numpy array
print(result.oof_prediction) # Out-of-fold prediction in numpy array
print(result.models) # Trained models for each fold
print(result.importance) # Feature importance for each fold
print(result.metrics) # Evalulation metrics for each fold
print(result.time) # Elapsed time
print(result.submission_df) # The output dataframe saved as submission.csv
# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).
# You can use it with mlflow and track your experiments through mlflow-ui
result = run_experiment(params,
X_train,
y_train,
X_test,
with_mlflow=True)
nyaggle also has a low-level API which has similar interface to mlflow tracking and wandb.
from nyaggle.experiment import Experiment
with Experiment(logging_directory='./output/') as exp:
# log key-value pair as a parameter
exp.log_param('lr', 0.01)
exp.log_param('optimizer', 'adam')
# log text
exp.log('blah blah blah')
# log metric
exp.log_metric('CV', 0.85)
# log numpy ndarray, pandas dafaframe and any artifacts
exp.log_numpy('predicted', predicted)
exp.log_dataframe('submission', sub, file_format='csv')
exp.log_artifact('path-to-your-file')
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from nyaggle.feature.category_encoder import TargetEncoder
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()
cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'
kf = KFold(5)
# Target encoding with K-fold
te = TargetEncoder(kf.split(train))
# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])
# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])
You need to install pytorch to your virtual environment to use BertSentenceVectorizer. MaCab and mecab-python3 are also required if you use the Japanese BERT model.
import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()
text_cols = ['body']
target_col = 'y'
group_col = 'user_id'
# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)
text_vector = bv.fit_transform(train)
# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)
text_vector_svd = bv.fit_transform(train)
# Japanese BERT
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')
japanese_text_vector = bv.fit_transform(train)
import pandas as pd
from nyaggle.validation import adversarial_validate
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
auc, importance = adversarial_validate(train, test, importance_type='gain')
nyaggle provides a set of validation splitters that are compatible with sklearn.
import pandas as pd
from sklearn.model_selection import cross_validate, KFold
from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth
train = pd.read_csv('train.csv', parse_dates='dt')
# time-series split
ts = TimeSeriesSplit(train['dt'])
ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))
ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))
cross_validate(..., cv=ts)
# take the first 3 folds out of 10
cross_validate(..., cv=Take(3, KFold(10)))
# skip the first 3 folds, and evaluate the remaining 7 folds
cross_validate(..., cv=Skip(3, KFold(10)))
# evaluate 1st fold
cross_validate(..., cv=Nth(1, ts))
Here is a list of awesome repositories that provide general utility functions for data science competitions. Please let me know if you have another one :)