quaquel / EMAworkbench

workbench for performing exploratory modeling and analysis
BSD 3-Clause "New" or "Revised" License
128 stars 90 forks source link

Add connectors for common machine-learning models #351

Open EwoutH opened 8 months ago

EwoutH commented 8 months ago

Motivation

Machine learning models are generally black boxes. These are difficult to explain or get to know the internal workings of. Meanwhile, they often provide state of the art performance in predictive tasks. There could be opportunities in understanding these blacks boxes better, by doing for example:

And basically most other features of the workbench.

Problem

Currently there are no connectors to use machine-learned models in the EMAworkbench.

On notable thing is that both input nodes (uncertainties / policy levers) and output ones (outcomes) need to be able to be named, for most useful analysis with the workbench.

Proposal

Create said connectors for the most common formats.

Sample implementation

These sample implementations support the most common model storage formats (Pickle/Joblib, HDF5, and ONNX). Given these formats serve different types of models (e.g., scikit-learn models, TensorFlow/Keras deep learning models, and cross-platform ONNX models), there will be a generic MLModelConnector base class. This class will provide the foundational structure and methods, which specialized connectors can then extend and customize based on the specific requirements of each format.

Base Connector Class

The base class will define the common interface and utilities for initializing models, running experiments (making predictions), and resetting models if necessary. It assumes models are used for prediction in a policy analysis context, focusing on loading the model and making predictions based on input parameters provided by the EMA Workbench.

from ema_workbench.em_framework.model import SingleReplication

class MLModelConnector(SingleReplication):
    def __init__(self, name, model_path=None, input_features=[], output_variables=[]):
        super().__init__(name)
        self.model_path = model_path
        self.input_features = input_features
        self.output_variables = output_variables
        self.model = None

    def model_init(self, policy, **kwargs):
        raise NotImplementedError("Model initialization must be implemented by subclasses")

    def run_experiment(self, experiment):
        raise NotImplementedError("Experiment execution must be implemented by subclasses")

    def reset_model(self):
        # Reset logic here if applicable. Some models may not require reset.
        pass

Pickle/Joblib Connector

This connector is tailored for loading and executing scikit-learn models (or any model) serialized with pickle or joblib.

import joblib

class PickleJoblibConnector(MLModelConnector):
    def model_init(self, policy, **kwargs):
        super().model_init(policy)
        # Load the model from a pickle or joblib file.
        self.model = joblib.load(self.model_path)

    def run_experiment(self, experiment):
        # Create input array based on named input_features
        X = [experiment[feature] for feature in self.input_features]
        predictions = self.model.predict([X])

        # Map predictions to named output_variables
        return {self.output_variables[i]: prediction for i, prediction in enumerate(predictions)}

HDF5 Connector for Keras/TensorFlow Models

This connector focuses on deep learning models saved in the HDF5 format by TensorFlow/Keras.

from tensorflow.keras.models import load_model

class HDF5Connector(MLModelConnector):
    def model_init(self, policy, **kwargs):
        super().model_init(policy)
        # Load the TensorFlow/Keras model from an HDF5 file.
        self.model = load_model(self.model_path)

    def run_experiment(self, experiment):
        # Create input array based on named input_features, ensuring correct shape
        X = np.array([[experiment[feature] for feature in self.input_features]])
        predictions = self.model.predict(X)

        # Assuming the model has a single output. Adjust for models with multiple outputs.
        # Map predictions to named output_variables
        return {self.output_variables[i]: prediction for i, prediction in enumerate(predictions.flatten())}

ONNX Connector

The ONNX connector is for models exported in the ONNX format, enabling cross-platform interoperability.

class ONNXConnector(MLModelConnector):
    def model_init(self, policy, **kwargs):
        super().model_init(policy)
        # Initialize ONNX runtime session for the model.
        self.session = ort.InferenceSession(self.model_path)
        self.input_name = self.session.get_inputs()[0].name  # Assumes a single input. Adjust as necessary.

    def run_experiment(self, experiment):
        # Create input array based on named input_features, correctly shaped for ONNX
        X = np.array([[experiment[feature] for feature in self.input_features]], dtype=np.float32)
        input_dict = {self.input_name: X}
        outputs = [node.name for node in self.session.get_outputs()]  # Get output node names

        predictions = self.session.run(outputs, input_dict)

        # Map predictions to named output_variables. This assumes a direct mapping and
        # that the length of predictions matches the number of output_variables.
        # Adjust as necessary for complex models with multiple outputs.
        return {self.output_variables[i]: predictions[0][0, i] for i in range(len(self.output_variables))}
EwoutH commented 8 months ago

Possible example models: