Simple Model problem - Githubissues

Hi,

I have the following example. How to rewrite the code - so I can use the model Predict for the given input? There is no example in the docs...

#include "yggdrasil_decision_forests/dataset/data_spec.h"
#include "yggdrasil_decision_forests/dataset/data_spec.pb.h"
#include "yggdrasil_decision_forests/dataset/data_spec_inference.h"
#include "yggdrasil_decision_forests/dataset/vertical_dataset_io.h"
#include "yggdrasil_decision_forests/learner/learner_library.h"
#include "yggdrasil_decision_forests/metric/metric.h"
#include "yggdrasil_decision_forests/metric/report.h"
#include "yggdrasil_decision_forests/model/model_library.h"
#include "yggdrasil_decision_forests/utils/filesystem.h"
#include "yggdrasil_decision_forests/utils/logging.h"
#include "yggdrasil_decision_forests/serving/decision_forest/decision_forest.h"
#include <chrono>

namespace ygg = yggdrasil_decision_forests;

int main(int argc, char** argv) {
  // Enable the logging. Optional in most cases.
  InitLogging(argv[0], &argc, &argv, true);

  // Import the model.
  LOG(INFO) << "Import the model";
  const std::string model_path = "/tmp/my_saved_model/1/assets";
  std::unique_ptr<ygg::model::AbstractModel> model;
  QCHECK_OK(ygg::model::LoadModel(model_path, &model));

  // Show information about the model.
  // Like :show_model, but without the list of compatible engines.
  std::string model_description;
  model->AppendDescriptionAndStatistics(/*full_definition=*/false,
                                        &model_description);
  LOG(INFO) << "Model:\n" << model_description;

  auto start = std::chrono::high_resolution_clock::now();

  // Compile the model for fast inference.
  const std::unique_ptr<ygg::serving::FastEngine> serving_engine =
      model->BuildFastEngine().value();
  const auto& features = serving_engine->features();

  // Handle to two features.
  const auto age_feature = features.GetNumericalFeatureId("age").value();
  const auto sex_feature =
      features.GetCategoricalFeatureId("sex").value();

  // Allocate a batch of 1 examples.
  std::unique_ptr<ygg::serving::AbstractExampleSet> examples =
      serving_engine->AllocateExamples(1);

  // Set all the values as missing. This is only necessary if you don't set all
  // the feature values manually e.g. SetNumerical.
  //examples->FillMissing(features);

  // Set the value of "age" and "eduction" for the first example.
  examples->SetNumerical(/*example_idx=*/0, age_feature, 50.f, features);
  examples->SetCategorical(/*example_idx=*/0, sex_feature, "Male",
                           features);

  // Run the predictions on the first two examples.
  std::vector<float> batch_of_predictions;
  serving_engine->Predict(*examples, 1, &batch_of_predictions);

  auto stop = high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<milliseconds>(stop - start);

  // To get the value of duration use the count()
  // member function on the duration object
  LOG(INFO) << duration.count();

  LOG(INFO) << "Predictions:";
  for (const float prediction : batch_of_predictions) {
    LOG(INFO) << "\t" << prediction;
  }

  return 0;
}

Output:

[INFO beginner4.cc:31] Import the model
[INFO beginner4.cc:41] Model:
Type: "RANDOM_FOREST"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (2):
    age
    sex

No weights

Variable Importance: MEAN_MIN_DEPTH:
    1. "__LABEL"  8.480250 ################
    2.     "sex"  1.313142 ##
    3.     "age"  0.000000 

Variable Importance: NUM_AS_ROOT:
    1. "age" 300.000000 

Variable Importance: NUM_NODES:
    1. "age" 34255.000000 ################
    2. "sex" 1584.000000 

Variable Importance: SUM_SCORE:
    1. "age" 516361.208020 ################
    2. "sex" 148174.885377 

Winner take all: true
Out-of-bag evaluation: accuracy:0.756613 logloss:5.68258
Number of trees: 300
Total number of nodes: 71978

Number of nodes by tree:
Count: 300 Average: 239.927 StdDev: 4.94044
Min: 223 Max: 251 Ignored: 0
----------------------------------------------
[ 223, 224)  1   0.33%   0.33%
[ 224, 225)  0   0.00%   0.33%
[ 225, 227)  1   0.33%   0.67%
[ 227, 228)  3   1.00%   1.67% #
[ 228, 230)  3   1.00%   2.67% #
[ 230, 231)  0   0.00%   2.67%
[ 231, 233)  9   3.00%   5.67% ##
[ 233, 234) 17   5.67%  11.33% ###
[ 234, 236) 33  11.00%  22.33% ######
[ 236, 237)  0   0.00%  22.33%
[ 237, 238) 28   9.33%  31.67% #####
[ 238, 240) 51  17.00%  48.67% ##########
[ 240, 241)  0   0.00%  48.67%
[ 241, 243) 46  15.33%  64.00% #########
[ 243, 244) 47  15.67%  79.67% #########
[ 244, 246) 34  11.33%  91.00% #######
[ 246, 247)  0   0.00%  91.00%
[ 247, 249) 13   4.33%  95.33% ###
[ 249, 250) 10   3.33%  98.67% ##
[ 250, 251]  4   1.33% 100.00% #

Depth by leafs:
Count: 36139 Average: 8.48037 StdDev: 2.34049
Min: 3 Max: 15 Ignored: 0
----------------------------------------------
[  3,  4)   70   0.19%   0.19%
[  4,  5)  662   1.83%   2.03% #
[  5,  6) 3633  10.05%  12.08% ######
[  6,  7) 3980  11.01%  23.09% #######
[  7,  8) 4520  12.51%  35.60% ########
[  8,  9) 5143  14.23%  49.83% #########
[  9, 10) 5817  16.10%  65.93% ##########
[ 10, 11) 5152  14.26%  80.18% #########
[ 11, 12) 3519   9.74%  89.92% ######
[ 12, 13) 2003   5.54%  95.46% ###
[ 13, 14)  975   2.70%  98.16% ##
[ 14, 15)  483   1.34%  99.50% #
[ 15, 15]  182   0.50% 100.00%

Number of training obs by leaf:
Count: 36139 Average: 190.182 StdDev: 179.903
Min: 5 Max: 2370 Ignored: 0
----------------------------------------------
[    5,  123) 14849  41.09%  41.09% ##########
[  123,  241) 10680  29.55%  70.64% #######
[  241,  359)  3917  10.84%  81.48% ###
[  359,  478)  6234  17.25%  98.73% ####
[  478,  596)   137   0.38%  99.11%
[  596,  714)     7   0.02%  99.13%
[  714,  833)    17   0.05%  99.18%
[  833,  951)    19   0.05%  99.23%
[  951, 1069)     6   0.02%  99.24%
[ 1069, 1188)   135   0.37%  99.62%
[ 1188, 1306)    33   0.09%  99.71%
[ 1306, 1424)     0   0.00%  99.71%
[ 1424, 1542)     0   0.00%  99.71%
[ 1542, 1661)     8   0.02%  99.73%
[ 1661, 1779)    53   0.15%  99.88%
[ 1779, 1897)     6   0.02%  99.89%
[ 1897, 2016)     0   0.00%  99.89%
[ 2016, 2134)     2   0.01%  99.90%
[ 2134, 2252)    28   0.08%  99.98%
[ 2252, 2370]     8   0.02% 100.00%

Attribute in nodes:
    34255 : age [NUMERICAL]
    1584 : sex [CATEGORICAL]

Attribute in nodes with depth <= 0:
    300 : age [NUMERICAL]

Attribute in nodes with depth <= 1:
    600 : age [NUMERICAL]
    300 : sex [CATEGORICAL]

Attribute in nodes with depth <= 2:
    1721 : age [NUMERICAL]
    379 : sex [CATEGORICAL]

Attribute in nodes with depth <= 3:
    3328 : age [NUMERICAL]
    1102 : sex [CATEGORICAL]

Attribute in nodes with depth <= 5:
    11208 : age [NUMERICAL]
    1583 : sex [CATEGORICAL]

Condition type in nodes:
    34255 : HigherCondition
    1584 : ContainsBitmapCondition
Condition type in nodes with depth <= 0:
    300 : HigherCondition
Condition type in nodes with depth <= 1:
    600 : HigherCondition
    300 : ContainsBitmapCondition
Condition type in nodes with depth <= 2:
    1721 : HigherCondition
    379 : ContainsBitmapCondition
Condition type in nodes with depth <= 3:
    3328 : HigherCondition
    1102 : ContainsBitmapCondition
Condition type in nodes with depth <= 5:
    11208 : HigherCondition
    1583 : ContainsBitmapCondition
Node format: BLOB_SEQUENCE

Training OOB:
    trees: 1, Out-of-bag evaluation: accuracy:0.750237 logloss:9.00239
    trees: 13, Out-of-bag evaluation: accuracy:0.754722 logloss:7.09704
    trees: 23, Out-of-bag evaluation: accuracy:0.753863 logloss:6.3117
    trees: 33, Out-of-bag evaluation: accuracy:0.75395 logloss:6.19856
    trees: 43, Out-of-bag evaluation: accuracy:0.754299 logloss:6.0429
    trees: 53, Out-of-bag evaluation: accuracy:0.753165 logloss:5.9747
    trees: 63, Out-of-bag evaluation: accuracy:0.754867 logloss:5.96594
    trees: 73, Out-of-bag evaluation: accuracy:0.75443 logloss:5.92934
    trees: 83, Out-of-bag evaluation: accuracy:0.754954 logloss:5.92356
    trees: 93, Out-of-bag evaluation: accuracy:0.756307 logloss:5.89293
    trees: 103, Out-of-bag evaluation: accuracy:0.756569 logloss:5.89233
    trees: 113, Out-of-bag evaluation: accuracy:0.755696 logloss:5.81216
    trees: 123, Out-of-bag evaluation: accuracy:0.755653 logloss:5.81
    trees: 133, Out-of-bag evaluation: accuracy:0.755347 logloss:5.80588
    trees: 143, Out-of-bag evaluation: accuracy:0.755914 logloss:5.77847
    trees: 153, Out-of-bag evaluation: accuracy:0.755783 logloss:5.77828
    trees: 163, Out-of-bag evaluation: accuracy:0.755522 logloss:5.74301
    trees: 173, Out-of-bag evaluation: accuracy:0.756264 logloss:5.73794
    trees: 183, Out-of-bag evaluation: accuracy:0.756176 logloss:5.7384
    trees: 193, Out-of-bag evaluation: accuracy:0.756831 logloss:5.73852
    trees: 203, Out-of-bag evaluation: accuracy:0.756613 logloss:5.73565
    trees: 213, Out-of-bag evaluation: accuracy:0.757268 logloss:5.73545
    trees: 223, Out-of-bag evaluation: accuracy:0.757486 logloss:5.7286
    trees: 233, Out-of-bag evaluation: accuracy:0.757093 logloss:5.72556
    trees: 243, Out-of-bag evaluation: accuracy:0.757093 logloss:5.7087
    trees: 253, Out-of-bag evaluation: accuracy:0.757224 logloss:5.70704
    trees: 263, Out-of-bag evaluation: accuracy:0.757006 logloss:5.69758
    trees: 273, Out-of-bag evaluation: accuracy:0.756831 logloss:5.69816
    trees: 283, Out-of-bag evaluation: accuracy:0.756526 logloss:5.69813
    trees: 293, Out-of-bag evaluation: accuracy:0.756526 logloss:5.68217
    trees: 300, Out-of-bag evaluation: accuracy:0.756613 logloss:5.68258

[INFO decision_forest.cc:639] Model loaded with 300 root(s), 71978 node(s), and 2 input feature(s).
[INFO abstract_model.cc:1158] Engine "RandomForestOptPred" built
[INFO beginner4.cc:79] 18
[INFO beginner4.cc:81] Predictions:
[INFO beginner4.cc:83]  0.649999

Lets say I want to continuously read the inference input data from stdin - I only need to load model and initialized serving_engine once?

every time stdin has new data to run a prediction I just need to run the following code? does anything change when I change to a http inference - anything to consider regarding thread safety - I cant share the model and serving_engine across multiple threads?

Is there anything I should change when I dont need batching? I still need to use std::unique_ptr<ygg::serving::AbstractExampleSet> examples = serving_engine->AllocateExamples(1)?

  // Handle to two features.
  const auto age_feature = features.GetNumericalFeatureId("age").value();
  const auto sex_feature =
      features.GetCategoricalFeatureId("sex").value();

  // Allocate a batch of 1 examples.
  std::unique_ptr<ygg::serving::AbstractExampleSet> examples =
      serving_engine->AllocateExamples(1);

  // Set all the values as missing. This is only necessary if you don't set all
  // the feature values manually e.g. SetNumerical.
  //examples->FillMissing(features);

  // Set the value of "age" and "eduction" for the first example.
  examples->SetNumerical(/*example_idx=*/0, age_feature, 50.f, features);
  examples->SetCategorical(/*example_idx=*/0, sex_feature, "Male",
                           features);

  // Run the predictions on the first two examples.
  std::vector<float> batch_of_predictions;
  serving_engine->Predict(*examples, 1, &batch_of_predictions);

  auto stop = high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<milliseconds>(stop - start);

  // To get the value of duration use the count()
  // member function on the duration object
  LOG(INFO) << duration.count();

  LOG(INFO) << "Predictions:";
  for (const float prediction : batch_of_predictions) {
    LOG(INFO) << "\t" << prediction;
  }

I also tried to use the c-api here, but always get as Result Tensor: 1.0? any idea why?

// gcc -I/usr/local/include -L/usr/local/lib main.c -ltensorflow -o main
#include <stdio.h>
#include <tensorflow/c/c_api.h>

void NoOpDeallocator(void* data, size_t a, void* b) {}

int main() {
  TF_Graph *Graph = TF_NewGraph();
  TF_Status *Status = TF_NewStatus();
  TF_SessionOptions *SessionOpts = TF_NewSessionOptions();
  TF_Buffer *RunOpts = NULL;
  TF_Library *library;

  library = TF_LoadLibrary("/usr/local/lib/python3.10/dist-packages/tensorflow_decision_forests/tensorflow/ops/inference/inference.so",
                              Status);

  const char *saved_model_dir = "/tmp/my_saved_model/1/";
  const char *tags = "serve";
  int ntags = 1;

  TF_Session *Session = TF_LoadSessionFromSavedModel(
      SessionOpts, RunOpts, saved_model_dir, &tags, ntags, Graph, NULL, Status);

  printf("status: %s\n", TF_Message(Status));

  if(TF_GetCode(Status) == TF_OK) {
    printf("loaded\n");
  }else{
    printf("not loaded\n");
  }

  /* Get Input Tensor */
  int NumInputs = 2;

  TF_Output* Input = malloc(sizeof(TF_Output) * NumInputs);
  TF_Output t0 = {TF_GraphOperationByName(Graph, "serving_default_age"), 0};

  if(t0.oper == NULL)
    printf("ERROR: Failed TF_GraphOperationByName serving_default_input_1\n");
  else
    printf("TF_GraphOperationByName serving_default_input_1 is OK\n");

  TF_Output t1 = {TF_GraphOperationByName(Graph, "serving_default_sex"), 0};

  if(t1.oper == NULL)
    printf("ERROR: Failed TF_GraphOperationByName serving_default_input_2\n");
  else
    printf("TF_GraphOperationByName serving_default_input_2 is OK\n");

  Input[0] = t0;
  Input[1] = t1;

  // Get Output tensor
  int NumOutputs = 1;
  TF_Output* Output = malloc(sizeof(TF_Output) * NumOutputs);
  TF_Output tout = {TF_GraphOperationByName(Graph, "StatefulPartitionedCall_1"), 0};

  if(tout.oper == NULL)
      printf("ERROR: Failed TF_GraphOperationByName StatefulPartitionedCall\n");
  else
    printf("TF_GraphOperationByName StatefulPartitionedCall is OK\n");

  Output[0] = tout;

  /* Allocate data for inputs and outputs */
  TF_Tensor** InputValues  = (TF_Tensor**)malloc(sizeof(TF_Tensor*)*NumInputs);
  TF_Tensor** OutputValues = (TF_Tensor**)malloc(sizeof(TF_Tensor*)*NumOutputs);

  int ndims = 1;
  int64_t dims[] = {1};
  int64_t data[] = {50};

  int ndata = sizeof(int64_t);
  TF_Tensor* int_tensor0 = TF_NewTensor(TF_INT64, dims, ndims, data, ndata, &NoOpDeallocator, 0);

  if (int_tensor0 != NULL)
    printf("TF_NewTensor is OK\n");
  else
    printf("ERROR: Failed TF_NewTensor\n");

  const char test_string[] = "Male";
  TF_TString tstr[1];
  TF_TString_Init(&tstr[0]);
  TF_TString_Copy(&tstr[0], test_string, sizeof(test_string)-1);
  TF_Tensor* int_tensor1 = TF_NewTensor(TF_STRING, NULL, 0, &tstr[0], sizeof(tstr), &NoOpDeallocator, 0);

  if (int_tensor1 != NULL)
    printf("TF_NewTensor is OK\n");
  else
    printf("ERROR: Failed TF_NewTensor\n");

  InputValues[0] = int_tensor0;
  InputValues[1] = int_tensor1;

  // Run the Session
  TF_SessionRun(Session,
                NULL, // Run options.
                Input, InputValues, NumInputs, // Input tensors name, input tensor values, number of inputs.
                Output, OutputValues, NumOutputs, // Output tensors name, output tensor values, number of outputs.
                NULL, 0, // Target operations, number of targets.
                NULL,
                Status); // Output status.

  if(TF_GetCode(Status) == TF_OK)
    printf("Session is OK\n");
  else
    printf("%s",TF_Message(Status));

  // Free memory
  TF_DeleteGraph(Graph);
  TF_DeleteSession(Session, Status);
  TF_DeleteSessionOptions(SessionOpts);
  TF_DeleteStatus(Status);

  /* Get Output Result */
  void* buff = TF_TensorData(OutputValues[0]);
  float* offsets = (float*)buff;
  printf("Result Tensor :\n");
  printf("%f\n",offsets[0]);
  return 0;
}

Output:

# gcc -I/usr/local/include -L/usr/local/lib main.c -ltensorflow -o main; ./main
2022-06-16 21:05:24.070748: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/my_saved_model/1/
2022-06-16 21:05:24.072148: I tensorflow/cc/saved_model/reader.cc:81] Reading meta graph with tags { serve }
2022-06-16 21:05:24.072208: I tensorflow/cc/saved_model/reader.cc:122] Reading SavedModel debug info (if present) from: /tmp/my_saved_model/1/
2022-06-16 21:05:24.072280: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-16 21:05:24.085806: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-06-16 21:05:24.086985: I tensorflow/cc/saved_model/loader.cc:228] Restoring SavedModel bundle.
2022-06-16 21:05:24.116570: I tensorflow/cc/saved_model/loader.cc:212] Running initialization op on SavedModel bundle at path: /tmp/my_saved_model/1/
[INFO kernel.cc:1176] Loading model from path /tmp/my_saved_model/1/assets/ with prefix cf8326335a66430a
[INFO decision_forest.cc:639] Model loaded with 300 root(s), 71978 node(s), and 2 input feature(s).
[INFO abstract_model.cc:1246] Engine "RandomForestOptPred" built
[INFO kernel.cc:1022] Use fast generic engine
2022-06-16 21:05:24.373058: I tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 302321 microseconds.
status: 
loaded
TF_GraphOperationByName serving_default_input_1 is OK
TF_GraphOperationByName serving_default_input_2 is OK
TF_GraphOperationByName StatefulPartitionedCall is OK
TF_NewTensor is OK
TF_NewTensor is OK
Session is OK
Result Tensor :
1.000000

Hi,

Let me answer the individual questions.

Lets say I want to continuously read the inference input data from stdin - I only need to load model and initialized serving_engine once?

Yes.

anything to consider regarding thread safety

The inference engine are thread safe. However, the example buffer (result of AllocateExamples) are not. You need to create on of them for each thread.

For example, Here is the multi-treaded code used to evaluate a model.

In your case (inference from a stream of data), I would create a StreamProcessor or a logic based on Channels.

I cant share the model and serving_engine across multiple threads?

Yes.

Note that you don't need to keep the model once you created the ending. I.e. release the model.

Is there anything I should change when I dont need batching?

Make sure to allocate AllocateExamples with enought example for your batch.

Is there anything I should change when I dont need batching? I still need to use std::unique_ptr examples = serving_engine->AllocateExamples(1)?

This is correct. Note that you can reuse your example accoss different predictions (instead of reallocating them every time).

I also tried to use the c-api here, but always get as Result Tensor: 1.0? any idea why?

The TF cc api is a bit more complex. Let's try to figure the solution with the YDF API and come back to the TF API if this does not work.

@achoum thanks for all the infos - do you have any examples using StreamProcessor / Channels with YDF c++ API?

I compared different ways to do model inference and checked the latencies:

for the simple model with 2 features - are those measurements expected? Yggdrasil Decision Forests is so much faster... does Yggdrasil speed come from the fast engine vs. tf serving and tf c api uses a slow/generic engine?

TF-DF Python: ~ 50 ms
TF Serving: ~ 20 ms
TF C API: ~ 19 ms
Yggdrasil Decision Forests (YDF) C++: ~ 100 μs

Yggdrasil code does not measure this part - because Compile the model for fast inference only needs to be done once per model:

  const std::unique_ptr<ygg::serving::FastEngine> serving_engine =
      model->BuildFastEngine().value();
  const auto& features = serving_engine->features();

here is what how I measure latency in Yggdrasil:

  // Compile the model for fast inference.
  const std::unique_ptr<ygg::serving::FastEngine> serving_engine =
      model->BuildFastEngine().value();
  const auto& features = serving_engine->features();

  auto start = high_resolution_clock::now();

  // Handle to two features.
  const auto age_feature = features.GetNumericalFeatureId("age").value();
  const auto sex_feature =
      features.GetCategoricalFeatureId("sex").value();

  // Allocate a batch of 1 examples.
  std::unique_ptr<ygg::serving::AbstractExampleSet> examples =
      serving_engine->AllocateExamples(1);

  // Set all the values as missing. This is only necessary if you don't set all
  // the feature values manually e.g. SetNumerical.
  //examples->FillMissing(features);

  // Set the value of "age" and "eduction" for the first example.
  examples->SetNumerical(/*example_idx=*/0, age_feature, 50.f, features);
  examples->SetCategorical(/*example_idx=*/0, sex_feature, "Male",
                           features);

  // Run the predictions on the first two examples.
  std::vector<float> batch_of_predictions;
  serving_engine->Predict(*examples, 1, &batch_of_predictions);

  auto stop = high_resolution_clock::now();
  auto duration = duration_cast<microseconds>(stop - start);

  // To get the value of duration use the count()
  // member function on the duration object
  LOG(INFO) << "duration: " << duration.count();

  LOG(INFO) << "Predictions:";
  for (const float prediction : batch_of_predictions) {
    LOG(INFO) << "\t" << prediction;
  }

I also checked the tf serving code. it also uses tf.load_op_library(resource_loader.get_path_to_datafile("inference.so"))

regarding the stream processor - can I use it as follows? any http server lib you recommend? any good way to load the fast_engine directly in the stream processor and pass the modelPath to the stream processor only? If the model has been used before and fast_engine exists - than no need to recreate that fast_engine for it.

std::unique_ptr<ygg::serving::FastEngine> initModelServingEngine(const string &modelPath) {
  // Import the model.
  LOG(INFO) << "Import the model";
  const std::string model_path = modelPath;
  std::unique_ptr<ygg::model::AbstractModel> model;
  QCHECK_OK(ygg::model::LoadModel(model_path, &model));

  // Show information about the model.
  // Like :show_model, but without the list of compatible engines.
  std::string model_description;
  model->AppendDescriptionAndStatistics(/*full_definition=*/false,
                                        &model_description);
  LOG(INFO) << "Model:\n" << model_description;

  // Compile the model for fast inference.
  const std::unique_ptr<ygg::serving::FastEngine> serving_engine = model->BuildFastEngine().value();

  return serving_engine;
}

float predict(const std::unique_ptr<ygg::serving::FastEngine> serving_engine, float age, const std::string &sex) {
  const auto& features = serving_engine->features();

  // Handle to two features.
  const auto age_feature = features.GetNumericalFeatureId("age").value();
  const auto sex_feature = features.GetCategoricalFeatureId("sex").value();

  // Allocate a batch of 1 examples.
  std::unique_ptr<ygg::serving::AbstractExampleSet> examples = serving_engine->AllocateExamples(1);

  // Set all the values as missing. This is only necessary if you don't set all
  // the feature values manually e.g. SetNumerical.
  //examples->FillMissing(features);

  // Set the value of "age" and "eduction" for the first example.
  examples->SetNumerical(/*example_idx=*/0, age_feature, age, features);
  examples->SetCategorical(/*example_idx=*/0, sex_feature, sex, features);

  // Run the predictions on the first two examples.
  std::vector<float> batch_of_predictions;
  serving_engine->Predict(*examples, 1, &batch_of_predictions);

  float prediction_res = 0.0;
  LOG(INFO) << "Predictions:";
  for (const float prediction : batch_of_predictions) {
    LOG(INFO) << "\t" << prediction;
    prediction_res = prediction;
  }

  return prediction_res;
}

struct features {
    float age;
    std::string sex;
    std::unique_ptr<ygg::serving::FastEngine> fast_engine;
};

int main() {
    // create 1 stream process, with multiple threads... in this case 5
    features input;
    input.age = 55.0f;
    input.sex = "Male";
    input.fast_engine = initModelServingEngine("/tmp/my_saved_model/1/assets"); // how to deal with multiple models - can I load them in StreamProcessor when they are needed?

    using Features = std::unique_ptr<features>;
    using Prediction = std::unique_ptr<float>;

    StreamProcessor<Features, Prediction > processor("MyPipe", 5,
                                            [](Features x, Prediction prediction) { 
        return predict(features.age, features.sex, features.fast_engine);
    });

    processor.StartWorkers();

    // when i receive a http request, i submit that data to the processor
    // the processor runs the inference and returns the data back
    // the returned data is available with processor.GetResult().value()
    processor.Submit(absl::make_unique<features>(input));
    auto result = processor.GetResult().value();

    // create http response with result

thanks for all the infos - do you have any examples using StreamProcessor / Channels with YDF c++ API?

Happy to help :).

Here is an example.

Yggdrasil Decision Forests is so much faster... does Yggdrasil speed come from the fast engine vs. tf serving and tf c api uses a slow/generic engine?

The inference of TF-DF models uses Yggdrasil Decision Forests fast engine underneath. So, all 4 options are equivalent in this regard. However, as you observed, TensorFlow always adds a significant overhead both for training and inference. AFAK the difference of speed comes from both the memory management and amount of executed logic.

If you want fast inference, you need to use Yggdrasil Decision Forests directly.

100µs is already a slow model for Yggdrasil Decision Forests. An inference time of ~1µs is common in my work (of course, it depends on the number of features and complexity of the model). If you care about inference speed, make sure to train a Gradient Boosted Trees (instead of a Random Forest).

Yggdrasil code does not measure this part - because Compile the model for fast inference only needs to be done once per model:

Yes. The model compilation / the creation of the inference engine should not be taken into account in the benchmark as it is only done once. When you load a TF-DF in tensorflow, the engine is also created at loading time.

here is what how I measure latency in Yggdrasil:

Note that there is a model benchmark tool available here. You can use it, or draw inspiration from it.

In your benchmark:

Don't include "GetNumericalFeatureId" in the benchmark timing. Only do it once when loading the model.
Don't include "AllocateExamples" in the benchmark timing. Only do it once per thread when loading the model.
Only call "FillMissing" if you don't already set all the feature manually.
Don't create "std::vector batch_of_predictions;" in the benchmark timing. Allocate it once per thread when loading the model.
Make sure to repeat your experiment many times with a for-loop. Also make sure to have a warm-up (i.e. non timed) set of inference.

regarding the stream processor - can I use it as follows?

It depends who your http server works and the number of examples per request. Can you tell me more about your case?

For example, if each request is executed in a separate thread (maybe from a pool of threads) and each request only has a few examples, you don't need a stream processor. However, if you have few concurrent requests, and that each request contains many examples, a stream processor makes sense. Note that you should not create a new processor for each request.

For some time, we had the plan to publish a basic http+grpc model inference server. Tell me if you are interested so we can balance prioritization efficiently.

To be sure: What is the reason for you to use a http server instead of running the model in process? Note that for offline inference, there are better solutions.

Cheers, M.

Hi @achoum

thanks for all the infos.

It depends who your http server works and the number of examples per request. Can you tell me more about your case? To be sure: What is the reason for you to use a http server instead of running the model in process? Note that for offline inference, there are better solutions.

I have strict latency requirements to run the online model inference:

I have several 1-10 models to run per request
1-50 examples (1-50 predictions to run which use 1-10 models) per request
max allowed latency to run all predictions: 20ms
current tech stack is Go

There are 3 options I see: Option 1: using Go TF library -> measured inference latency to be very slow / so not an option Option 2: call yggdrasil-decision-forests C++ from Go Option 3: add a C++ model inference server (yggdrasil-decision-forests + http/grpc) for the Go service to communicate over http or GRPC

I tried Option 2 and run into some issues - would like to get your feedback - any idea what goes wrong here? https://github.com/Arnold1/yggdrasil-decision-forests

For some time, we had the plan to publish a basic http+grpc model inference server. Tell me if you are interested so we can balance prioritization efficiently.

I would appreciate if you have something already you can share? I think http server which calls yggdrasil-decision-forests predict c++ code to start with should be enough

Hi Arnold1,

So, in the worst case scenario, you want 50*10 predictions in 20ms. This is 10µs per examples. This is likely possible for a GBT model without multi-threading (it depends on the input features though). For a Random Forest, you will likely need multi-threading or to limit the depth/number of trees.

Regarding your link, the error INVALID_ARGUMENT: Unknown item RANDOM_FOREST in class pool N26yggdrasil_decision_forests5model13AbstractModelE. Registered elements are exit status 1 indicates that the Random Forest model is not linked. Make sure to add a dependency to this rule or to this rule.

Regarding Go, a forth option is to wait for us to release publicly the Go port of Yggdrasil. This is a partial re-implementation in Go of the inference code of Yggdrasil. Our benchmark indicates that this is ~ a 2x slow down.

Regarding the http+grpc model inference server, I don't have anything production read available. This is the GRPC distribute manager. This is significantly more complex than what you need, but it shows how to create a GRPC server. Regarding the serialization format for examples, this is a generic possible solution. Note that example serialization/network/deserialization can take a significant amount of time.

I'll update this post when either the Go port is published or the http+grpc model inference server is available.

The Go port is now available :).

@achoum amazing work, thank you! I think we can close that issue than :)

Btw, let us know if the Go version matches your latency requirements. The Go version doesn't implement the fastest algorithm -- it would be much more complicated to implement (we welcome contributions btw, if you are interested we can send some pointers)-- but still it should be fast.

Plus in Go it's easily parallelizable, you can run each model in a different GoRoutine, no issues.

google / yggdrasil-decision-forests

Simple Model problem #22