google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
483 stars 52 forks source link

Converting from Scikit-Learn or ONNX models #25

Closed jiazou-bigdata closed 2 years ago

jiazou-bigdata commented 2 years ago

We have trained tree ensemble models using Scikit-Learn (converted to ONNX) and XGBoost. Is there a way to load the model into yggdrasil so that we can leverage the fast serving engine for inference?

achoum commented 2 years ago

Yes.

The Scikit-Learn Converter is a python script that converts tree-based Scikit learn models into TensorFlow Decision Forests (TF-DF) models.

Since TF-DF is build on top of Yggdrasil, you can use all the Yggdrasil framework (including the infernece code) on TF-DF models.

jiazou-bigdata commented 2 years ago

That works! Thanks. Is there any tfdf converter for XGBoost models? If not, is it easy for us to develop such a converter? Any suggestions?

achoum commented 2 years ago

Great to hear the Scikit-learn convertor works for your case :).

I am not aware of a converter from XGBoost→TF-DF or XGBoost→Scikit-learn. It seems the only solution to access the tree structure of an XGBoost model programmatically in python is to parse the string returned by dump_model (see this post).

If and once you figure a way to obtain the structure of the trees, you can use the TF-DF model builder (example) to create the TF-DF model. This is what the Scikit-learn→TF-DF converter is doing.

Please, keep us in touch if you find a better alternative, or if you make progress in this direction!

jiazou-bigdata commented 2 years ago

Thank you so much. I find a way to invoke get_dump from an XGBoost model to convert it to a JSON string. I've also written some code like the following, but I got "ValueError: A GBT model should only have leaf with regressive value. Got -0.564044952 instead."

I am investigating this. Any suggestions are welcome!

def getTrees(xgboost_model):
    params = xgboost_model.get_xgb_params()
    objective = params["objective"]
    base_score = params["base_score"]
    if base_score is None:
        base_score = 0.5
    booster = xgboost_model.get_booster()
    # The json format was available in October 2017.
    # XGBoost 0.7 was the first version released with it.
    js_tree_list = booster.get_dump(with_stats=True, dump_format='json')
    js_trees = [json.loads(s) for s in js_tree_list]
    return objective, base_score, js_trees

def build_tfdf_model(
    xgboost_model: XGBClassifier,
    path: os.PathLike,
) -> tf.keras.Model:
  """Converts a XGBoost model into a TFDF model."""
  bias = 0.0

  gbt_builder = tfdf.builder.GradientBoostedTreeBuilder(
      path=path,
      objective=tfdf.py_tree.objective.ClassificationObjective(label="label",
          classes=[str(c) for c in xgboost_model.classes_],
      ),
      bias=bias,
  )

  objective, base_score, js_trees = getTrees(xgboost_model)

  params = xgboost_model.get_xgb_params()

  for jstree in js_trees:
    gbt_builder.add_tree(convert_xgboost_tree_to_tfdf_pytree(
        jstree
    ))
  gbt_builder.close()
  return tf.keras.models.load_model(path)

def convert_xgboost_tree_to_tfdf_pytree(
    xgboost_tree: str,
) -> tfdf.py_tree.tree.Tree:
  """Converts a XGBoost decision tree into a TFDF pytree.

  Args:
    xgboost_tree: a XGBoost decision tree in JSON format.

  Returns:
    a TFDF pytree that has the same structure as the xgboost tree.
  """

  root_node = _convert_xgboost_node_to_tfdf_node(
      xgboost_tree
  )
  return tfdf.py_tree.tree.Tree(root_node)

def _convert_xgboost_node_to_tfdf_node(
    jsnode: str,
) -> tfdf.py_tree.node.AbstractNode:
  """Converts a node within a xgboost tree into a TFDF node."""

  if 'children' in jsnode:
      feature = tfdf.py_tree.dataspec.SimpleColumnSpec(
              name = f"feature_{jsnode['split']}",
              type = tfdf.py_tree.dataspec.ColumnType.NUMERICAL,
              col_idx = jsnode['split'],)
      neg_child = _convert_xgboost_node_to_tfdf_node(
              jsnode['children'][1],
              )
      pos_child = _convert_xgboost_node_to_tfdf_node(
              jsnode['children'][0],
              )
      return tfdf.py_tree.node.NonLeafNode(
              condition = tfdf.py_tree.condition.NumericalHigherThanCondition(
                  feature = feature,
                  threshold = jsnode['split_condition'],
                  missing_evaluation = False,
                  ),
              pos_child = pos_child,
              neg_child = neg_child,
              )
  else:
      return tfdf.py_tree.node.LeafNode(value = jsnode['leaf'])
jiazou-bigdata commented 2 years ago

The below fix seems working now:

feature = tfdf.py_tree.dataspec.SimpleColumnSpec(
              name = f"feature_{jsnode['split'][1:]}",
              type = tfdf.py_tree.dataspec.ColumnType.NUMERICAL,
              col_idx = jsnode['split'],)
def _convert_xgboost_node_to_tfdf_node(
    jsnode: str,
) -> tfdf.py_tree.node.AbstractNode:
  """Converts a node within a xgboost tree into a TFDF node."""

  if 'children' in jsnode:
      feature = tfdf.py_tree.dataspec.SimpleColumnSpec(
              name = f"feature_{jsnode['split'][1:]}",
              type = tfdf.py_tree.dataspec.ColumnType.NUMERICAL,
              col_idx = jsnode['split'],)
      neg_child = _convert_xgboost_node_to_tfdf_node(
              jsnode['children'][1],
              )
      pos_child = _convert_xgboost_node_to_tfdf_node(
              jsnode['children'][0],
              )
      return tfdf.py_tree.node.NonLeafNode(
              condition = tfdf.py_tree.condition.NumericalHigherThanCondition(
                  feature = feature,
                  threshold = jsnode['split_condition'],
                  missing_evaluation = False,
                  ),
              pos_child = pos_child,
              neg_child = neg_child,
              )
  else:
      target_value = jsnode['leaf']
      node_value = tfdf.py_tree.value.RegressionValue(target_value)
      return tfdf.py_tree.node.LeafNode(value = node_value)
rstz commented 2 years ago

Thank you very much for keeping us updated!

If there's any details you can share about your project / findings (either here or through email), we'd be happy to learn more.

janpfeifer commented 2 years ago

Thanks for posting your solution @jiazou-bigdata !

Would you be interested in contributing it as another converter in TF Decision Forests ?

Also, let us know if you need anything from the Yggdrasil Decision Forests library. When possible we collaborate with research teams (we have both problems and sometimes solutions that may be research worthy), and the library, while production ready, was made to be relatively easily extended, to facilitate research and experimentation.

jiazou-bigdata commented 2 years ago

@rstz and @janpfeifer

I've just tried more tests and found the testing accuracy dropped with the converted XGBoost model trained on Higgs. I noticed that the XGBoost model used Branch_LT (less than) mode for node conditions. But the TF-DF is using tfdf.py_tree.condition.NumericalHigherThanCondition. I am looking into this, and any comments or suggestions are highly appreciated. I am interested in contributing this as a converter if this issue can be figured out.

We are mainly interested in the performance of the QuickScorer algorithm for XGBoost model inference. We think that TF-DF may be something we can leverage. By the way, during the process of figuring out TF-DF, we did have some observations on the performance of TFDF regarding CopyVerticalDatasetToAbstractExampleSet, multi-threading, etc. We wil be happy to summarize and share such findings with the development team later.

jiazou-bigdata commented 2 years ago

@rstz and @janpfeifer

I want to report the progress here. Now the accuracy is consistent before and after the conversion. The trick I made is to switch the true node and false node, so that the true/false node in XGBoost model becomes the false/true node in TFDF.

The complete conversion code is pasted below. Please let me know if any issues and the process to contribute the code, if possible. Thanks.

"""Utilities for converting XGBoost models into Tensorflow models."""

import contextlib
import json
import numpy as np
import os
import tempfile
from typing import Any, Dict, List, Optional, TypeVar, Union

from xgboost import XGBClassifier
import tensorflow as tf
import tensorflow_decision_forests as tfdf

def get_trees(xgboost_model):
    booster = xgboost_model.get_booster()
    # The json format was available in October 2017.
    # XGBoost 0.7 was the first version released with it.
    js_tree_list = booster.get_dump(with_stats=True, dump_format='json')
    js_trees = [json.loads(s) for s in js_tree_list]
    return js_trees

def convert(
    xgboost_model: XGBClassifier,
    intermediate_write_path: Optional[os.PathLike] = None,
) -> tf.keras.Model:
  """Converts a tree-based XGBoost model to a tensorflow model.
  Args:
    xgboost_model: the XGBoost model to be converted.
    intermediate_write_path: path to a directory. As part of the conversion
      process, a TFDF model is written to disk. If intermediate_write_path is
      specified, the TFDF model is written to this directory. Otherwise, a
      temporary directory is created that is immediately removed after this
      function executes. Note that in order to save the converted model and
      load it again later, this argument must be provided.

  Returns:
    a keras Model that emulates the provided XGBoost model.
  """
  if not intermediate_write_path:
    # No intermediate directory was provided, so this creates one using the
    # TemporaryDirectory context mananger, which handles teardown.
    intermediate_write_directory = tempfile.TemporaryDirectory()
    path = intermediate_write_directory.name
  else:
    # Uses the provided write path, and creates a null context manager as a
    # stand-in for TemporaryDirectory.
    intermediate_write_directory = contextlib.nullcontext()
    path = intermediate_write_path
  with intermediate_write_directory:
    tfdf_model = build_tfdf_model(xgboost_model, path)
  # The resultant tfdf model only receives the features that are used
  # to split samples in nodes in the trees as input. But we want to pass the
  # full design matrix as an input to match the scikit-learn API, thus we
  # create another tf.keras.Model with the desired call signature.
  template_input = tf.keras.Input(shape=(xgboost_model.n_features_in_,))
  # Extracts the indices of the features that are used by the TFDF model.
  # The features have names with the format "feature_<index-of-feature>".
  feature_names = tfdf_model.signatures[
      "serving_default"].structured_input_signature[1].keys()
  template_output = tfdf_model(
      {i: template_input[:, int(i.split("_")[1])] for i in feature_names})
  return tf.keras.Model(inputs=template_input, outputs=template_output)

def build_tfdf_model(
    xgboost_model: XGBClassifier,
    path: os.PathLike,
) -> tf.keras.Model:
  """Converts a XGBoost model into a TFDF model."""
  bias = 0.0

  gbt_builder = tfdf.builder.GradientBoostedTreeBuilder(
      path=path,
      objective=tfdf.py_tree.objective.ClassificationObjective(label="label",
          classes=[str(c) for c in xgboost_model.classes_],
      ),
      bias=bias,
  )

  js_trees = get_trees(xgboost_model)

  params = xgboost_model.get_xgb_params()

  print("XGBOOST_MODEL LEARNING RATE:")
  print(xgboost_model.learning_rate)

  for jstree in js_trees:
    gbt_builder.add_tree(convert_xgboost_tree_to_tfdf_pytree(
        jstree,
        xgboost_model.learning_rate,
    ))
  gbt_builder.close()
  return tf.keras.models.load_model(path)

def convert_xgboost_tree_to_tfdf_pytree(
    xgboost_tree: str,
    weight: Optional[float] =None,
) -> tfdf.py_tree.tree.Tree:
  """Converts a XGBoost decision tree into a TFDF pytree.

  Args:
    xgboost_tree: a XGBoost decision tree in JSON format.

  Returns:
    a TFDF pytree that has the same structure as the xgboost tree.
  """

  root_node = _convert_xgboost_node_to_tfdf_node(
      xgboost_tree,
      weight,
  )
  return tfdf.py_tree.tree.Tree(root_node)

def _convert_xgboost_node_to_tfdf_node(
    jsnode: str,
    weight: Optional[float] =None,
) -> tfdf.py_tree.node.AbstractNode:
  """Converts a node within a xgboost tree into a TFDF node."""

  if 'children' in jsnode:
      feature = tfdf.py_tree.dataspec.SimpleColumnSpec(
              name = f"feature_{jsnode['split'][1:]}",
              type = tfdf.py_tree.dataspec.ColumnType.NUMERICAL,
              col_idx = jsnode['split'],)
      neg_child = _convert_xgboost_node_to_tfdf_node(
              jsnode['children'][0],
              )
      pos_child = _convert_xgboost_node_to_tfdf_node(
              jsnode['children'][1],
              )
      return tfdf.py_tree.node.NonLeafNode(
              condition = tfdf.py_tree.condition.NumericalHigherThanCondition(
                  feature = feature,
                  threshold = jsnode['split_condition'],
                  missing_evaluation = False,
                  ),
              pos_child = pos_child,
              neg_child = neg_child,
              )
  else:
      target_value = jsnode['leaf']
      scale_factor = 1.0
      if weight:
          scale_factor = weight
      node_value = tfdf.py_tree.value.RegressionValue(target_value*scale_factor)
      return tfdf.py_tree.node.LeafNode(value = node_value)
rstz commented 2 years ago

@jiazou-bigdata Thank you very much, this looks great and we would love to add this converter to TF-DF.

The easiest way to officially contribute this is if you could put the above code into tensorflow_decision_forests/contrib/xgboost_model_converter/xgboost_model_converter.py and open a PR with this file. A bot will then prompt you to sign the Contributor License Agreement, see https://github.com/google/yggdrasil-decision-forests/blob/main/CONTRIBUTING.md . You (or your employer) retain the copyright to your contribution, this simply gives us permission to use and redistribute your contributions as part of the project.

Please let us know if there is any issue with this process.

jiazou-bigdata commented 2 years ago

@rstz Thank you so much for the guidance. Will do!

jiazou-bigdata commented 1 year ago

@achoum @rstz @janpfeifer,

Hi, I hope this message finds you well.

Recently we finished a benchmark study of more than ten decision forest inference platforms on eight diversified datasets using different number of trees (10, 500, 1600) and different forest algorithms (RandomForest, XGBoost, LightGBM) on CPU and GPU. You can find the study here.

We also considered TFDF and Yggdrasil in single-thread comparison (Table 10 in the paper). It showed that the single-thread performance of Yggdrasil (using the QuickScorer algorithm) outperformed most of other platforms!

(We didn't consider TFDF and Yggdrasil in other multi-threading comparisons, because it seems as to this point, there is no knob to fully parallelize the inference process in Yggdrasil to make it a fair comparison.)

If possible, please do not hesitate a moment to let us know if you have any questions or suggestions for our study.

Thank you!

janpfeifer commented 1 year ago

hi @jiazou-bigdata , thanks for sharing the results, the paper is super interesting, and I'm looking forward to learning more about netsDB -- as I understand it, it's a great thing when computation should move closer to the data whenever possible.

About the multi-threading: it's a long standing TODO for us to implement the model parallelism -- data parallelism is easier achieved by the client, since our library is re-entrant.

In part because our clients are either (1) very latency conscious -- in which case they usually use small models (<100 trees, usually inference < 1 microsecond) and the overhead of parallelism may not pay off(?) -- or (2) they don't care, and they are more interested in throughput, in which case data parallelism (separate threads doing inference in separate batch of examples) is simpler and efficient.

GPU is also something we have some experiments with ... we haven't yet invested because of we haven't seen the need from clients. I'm assuming netDBs would run on servers with GPUs and would benefit from it, is that correct ?

Btw, one suggestion for the study: what about an additional table on throughput (as opposed to latency) ? Specially for the larger models ... Given a fixed hardware, how many inferences/second can one get through ... For our use cases this is often more relevant (and the same may be true for many use cases of netsDB, no?)

cheers, and again thanks for sharing! This motivates us to improve it even further.

ps.: And btw, let us know if you find out special cases that could be really important (help lots of folks), that would guide our optimization efforsts.