shane-kercheval / oo-learning

Python machine learning library based on Object Oriented design principles; the goal is to allow users to quickly explore data and search for top machine learning algorithm candidates for a given dataset
MIT License
1 stars 0 forks source link
data-analysis data-science machine-learning object-oriented-programming python

:no_entry: [DEPRECATED]

oo-learning

oo-learning is a Python machine learning library based on Object Oriented design principles.

The goal of the project is to allow users to quickly explore data and search for top machine learning algorithm candidates for a given dataset.

More specifically, this library implements common workflows when exploring data and trying various featuring engineering techniques and machine learning algorithms.

The power of object-oriented design means the user can easily interchange various objects (e.g. transformations, evaluators, resampling techniques), and even extend or build their own.

After model selection, if implementing the model in a production system, the user may or may not want to use a more mature library such as scikit-learn.

Installing

pip install oolearning

Mac M1

brew install cmake libomp
conda install lightgbm
conda install py-xgboost

References:

Conventions / Definitions

Conventions

Class Terminology

Examples

https://github.com/shane-kercheval/oo-learning/tree/master/examples/classification-titanic

ModelTrainer Snippet

# define how we want to split the training/holding datasets
splitter = ClassificationStratifiedDataSplitter(holdout_ratio=0.2)

# define the transformations
transformations = [RemoveColumnsTransformer(['PassengerId', 'Name', 'Ticket', 'Cabin']),
                   CategoricConverterTransformer(['Pclass', 'SibSp', 'Parch']),
                   ImputationTransformer(),
                   DummyEncodeTransformer(CategoricalEncoding.DUMMY)]

# Define how we want to evaluate (and convert the probabilities DataFrame to predicted classes)
evaluator = TwoClassProbabilityEvaluator(...)

# give the objects, which encapsulate the behavior of everything involved with training the model, to our ModelTrainer
trainer = ModelTrainer(model=LogisticClassifier(),
                       model_transformations=transformations,
                       splitter=splitter,
                       evaluator=evaluator)
trainer.train(data=data, target_variable='Survived', hyper_params=LogisticClassifierHP())

# access holdout metricsscore_names
trainer.holdout_evaluator.all_quality_metrics

Code Snippet from Training a model notebook.

GridSearchModelTuner Snippet

# define the transformations
transformations = [RemoveColumnsTransformer(['PassengerId', 'Name', 'Ticket', 'Cabin']),
                   CategoricConverterTransformer(['Pclass', 'SibSp', 'Parch']),
                   ImputationTransformer(),
                   DummyEncodeTransformer(CategoricalEncoding.ONE_HOT)]

# define the scores, which will be used to compare the performance across hyper-param combinations
# the scores need a Converter, which contains the logic necessary to convert the predicted values to a predicted class.
score_list = [AucRocScore(positive_class='lived'),
              SensitivityScore(...),
              SpecificityScore(...),
              ErrorRateScore(...)]

# define/configure the resampler
resampler = RepeatedCrossValidationResampler(model=RandomForestClassifier(),  # using a Random Forest model
                                             transformations=transformations,
                                             scores=score_list,
                                             folds=5,
                                             repeats=5)
GridSearchModelTuner
tuModelTunerGridSearchlTuner(reModelTunerGridSearchsampler,
                   hyper_param_object=RandomForestHP())  # Hyper-Parameter object specific to RFTBD

# define the parameter values (and, therefore, combinations) we want to try 
params_dict = dict(criterion='gini',
                   max_features=[1, 5, 10],
                   n_estimators=[10, 100, 500],
                   min_samples_leaf=[1, 50, 100])
grid = HyperParamsGrid(params_dict=params_dict)

tuner.tune(data_x=training_x, data_y=training_y, params_grid=grid)
tuner.results.get_heatmap()

Code Snippet from Tuning notebook.

ModelSearcher Snippet

# Logistic Regression Hyper-Param Grid
log_grid = HyperParamsGrid(params_dict=dict(penalty=['l1', 'l2'],
                                            regularization_inverse=[0.001, 0.01, 0.1, 1, 100, 1000]))

# get the expected columns at the time we do the training, based on the transformations 
columns = TransformerPipeline.get_expected_columns(transformations=global_transformations, data=explore.dataset.drop(columns=[target_variable]))
# Random Forest Hyper-Param Grid
rm_grid = HyperParamsGrid(params_dict=dict(criterion='gini',
                                           max_features=[int(round(len(columns) ** (1 / 2.0))),
                                                         int(round(len(columns) / 2)),
                                                         len(columns) - 1],
                                           n_estimators=[10, 100, 500],
                                           min_samples_leaf=[1, 50, 100]))

# define the models and hyper-parameters that we want to search through
infos = [ModelInfo(description='dummy_stratified',
                   model=DummyClassifier(DummyClassifierStrategy.STRATIFIED),
                   transformations=None,
                   hyper_params=None,
                   hyper_params_grid=None),
         ModelInfo(description='dummy_frequent',
                   model=DummyClassifier(DummyClassifierStrategy.MOST_FREQUENT),
                   transformations=None,
                   hyper_params=None,
                   hyper_params_grid=None),
         ModelInfo(description='Logistic Regression',
                   model=LogisticClassifier(),
                   # transformations specific to this model
                   transformations=[CenterScaleTransformer(),
                                    RemoveCorrelationsTransformer()],
                   hyper_params=LogisticClassifierHP(),
                   hyper_params_grid=log_grid),
         ModelInfo(description='Random Forest',
                   model=RandomForestClassifier(),
                   transformations=None,
                   hyper_params=RandomForestHP(),
                   hyper_params_grid=rm_grid)]

# define the transformations that will be applied to ALL models
global_transformations = [RemoveColumnsTransformer(['PassengerId', 'Name', 'Ticket', 'Cabin']),
                          CategoricConverterTransformer(['Pclass', 'SibSp', 'Parch']),
                          ImputationTransformer(),
                          DummyEncodeTransformer(CategoricalEncoding.ONE_HOT)]

# define the Score objects, which will be used to choose the "best" hyper-parameters for a particular model,
# and compare the performance across model/hyper-params, 
score_list = [AucRocScore(positive_class='lived'),
# the SensitivityScore needs a Converter, 
# which contains the logic necessary to convert the predicted values to a predicted class.
              SensitivityScore(converter=TwoClassThresholdConverter(threshold=0.5, positive_class='lived'))]

# create the ModelSearcher object
searcher = ModelSearcher(global_transformations=global_transformations,
                         model_infos=infos,
                         splitter=ClassificationStratifiedDataSplitter(holdout_ratio=0.25),
                         resampler_function=lambda m, mt: RepeatedCrossValidationResampler(
                             model=m,
                             transformations=mt,
                             scores=score_list,
                             folds=5,
                             repeats=3))
searcher.search(data=explore.dataset, target_variable='Survived')

Code Snippet from Searcher notebook.

Known Issues

Available Models

R = Regression; 2C = Two-class Classification; MC = Multi-class Classification

Future (TBD)

Unit Tests

The unit tests in this project are all found in the tests directory.

In the terminal, run the following in the root project directory:

python -m unittest discover ./tests