google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
464 stars 49 forks source link

performance issue with random forests #18

Closed amir2040 closed 8 months ago

amir2040 commented 2 years ago

Hi, I'm running a RANDOM_FOREST model trained in tf_df, by using yggdrasil c++ api and inference time taking about 50 μs, But as you said it probably shouldn't take more than 10 μs.

Also running in large batches(vs batch size =1) or using --copt=-mavx2 doesn't make a difference at all! I've used benchmark_inference tool and result was the same. Another interesting observation was difference between min & max execution time per instance, exec times for 10 run: ######################################## 0 max 2133059 min 17293 avg 52250 1 max 1054634 min 16696 avg 52982 2 max 1038110 min 14949 avg 45468 3 max 1068611 min 16752 avg 53064 4 max 1657415 min 16790 avg 54514 5 max 1125537 min 16432 avg 53145 6 max 1939590 min 17591 avg 74354 7 max 2997816 min 17284 avg 70325 8 max 1064365 min 16554 avg 56063 9 max 1044182 min 16145 avg 51429 ########################################

even if i ignore some of first execution times (for cache miss) the variance between exec times are still high. ######################################## 0 max 1488318 min 15841 avg 51085 1 max 955384 min 16501 avg 45567 2 max 928377 min 16370 avg 44606 3 max 1018261 min 15124 avg 44204 4 max 1429345 min 17299 avg 79810 5 max 1628887 min 17539 avg 80997 6 max 2126679 min 16487 avg 67346 7 max 1058939 min 16616 avg 53941 8 max 1098449 min 16242 avg 48047 9 max 1103341 min 16659 avg 53750 ########################################

Wondering if there is a problem in model or inference setup or this is the best performance i can get. model spec: RANDOM_FOREST 300 root(s), 618972 node(s), and 28 input feature(s). RandomForestOptPred engine

compiled with this flags: --config=linux_cpp17 --config=linux_avx2 --repo_env=CC=gcc-9 --copt=-mavx2

system spec: Ubuntu 18.04.4 cpu Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz on esxi virtual machine

achoum commented 2 years ago

Hi Amir,

Thanks for the report. Here are some thoughts.

The inference speed of the model depends on multiple factors:

  1. The most important fact is the type of model. Generally, Random Forest models are slower than the Gradient Boosted Tree models. In your case, and if the quality is similar, the main solution to speed-up the inference is to use a Gradient Boosted Tree model instead. The impact of --copt=-mavx2 should be visible then.

  2. The second important factor is size of the model. And this size depends on the size of the datasets (number of examples, number of features, type of features) and its hyper-parameters. For example, increasing the "num_trees" of the Random Forest model will make it more accurate but also bigger. In this post, the dataset was small (few hundred examples).

  3. The speed and load of your machine. Since the benchmark is not multi-threaded, the reported value is ~ the inference speed per core. Those figures should be relatively stable. Unstable results could be explained by other programs running concurrently as the benchmark, or because of variance in the CPU speed (for example, if your machine is configured with power-saving mode; i.e. make sure to run benchmarks in "performance" mode).

How many examples are there in the benchmark? If this value is small, the stability of the results can be improved using more rounds (i.e. increase the value of --num_runs).

amir2040 commented 2 years ago

Thanks for reply and suggestions, There is about 19k examples in benchmark and using more rounds didn't make a difference. I'm using performance mode and an isolated core to run benchmark. Number of records in dataspec is about 75k. So the question about huge difference between min, max and average still remains. If i can get avg execution time near min (17μs ) that would be ideal. Maybe some of longer execution times occur because of cache misses(specially for first examples), but following runs still take way more than min.


update: Ah, after some observation i've noticed exec times for a specific example are quite stable, but some examples take more time than others. It should be because of difference in leaf depth. A file containing exec times (in nanosec) attached.

sbs.txt

achoum commented 2 years ago

Can you do the following:

  1. Train and measure the speed of a gradient boosted model.
  2. Report the node depth histogram of your Random Forest model. It is printed in the model description (c++, cli).
  3. Since you are not using the benchmark_inference tool, can you share details about how you implemented your benchmark. Notably, the file you shared seems to contain the timing of individual model calls (instead of average over multiple calls). Make sure not to contaminate you benchmark with it.
  4. For the sake of it, can you also share the performance of the benchmark_inference tool.
amir2040 commented 2 years ago

Thanks for suggestions. 1-I'll be working on it. 2- Depth by leafs: Count: 309636 Average: 12.1423 StdDev: 2.35022 Min: 3 Max: 15 Ignored: 0 [ 3, 4) 15 0.00% 0.00% [ 4, 5) 128 0.04% 0.05% [ 5, 6) 754 0.24% 0.29% [ 6, 7) 2882 0.93% 1.22% [ 7, 8) 6945 2.24% 3.46% # [ 8, 9) 13921 4.50% 7.96% ## [ 9, 10) 22572 7.29% 15.25% ### [ 10, 11) 30770 9.94% 25.19% ##### [ 11, 12) 38115 12.31% 37.50% ###### [ 12, 13) 42086 13.59% 51.09% ###### [ 13, 14) 43319 13.99% 65.08% ####### [ 14, 15) 42147 13.61% 78.69% ###### [ 15, 15] 65982 21.31% 100.00% ##########

3-I've followed the example in usermanual : loading model, building fast engine, getting features by name, allocation and setting example and finally calling predict(the part i measure). I can run it with different batch sizes but using different batch size(num_examples) doesn't change exec time. I've done another benchmark where i set one example then call predict like 10 times sth like this:

for (int iter = 0; iter < ITERATION; iter++)
      {
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &begin);
        engine->Predict(*examples, /*num_examples=*/1, &predictions);
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
        long seconds = end.tv_sec - begin.tv_sec;
        long nanoseconds = end.tv_nsec - begin.tv_nsec;
        long exec_time = seconds * 1e9 + nanoseconds;
        if(iter != 0){
          avg += exec_time;
          min = exec_time < min ? exec_time:min;
          max = exec_time > max ? exec_time:max;
        }

      }

and exec time comes down to about 35 μs. 4- Result with this options:(changing them doesn't have much impact on exec time) -generic=false --batch_size=32 --num_runs=40 --warmup_runs=5 time/example(us) time/batch(us) method 47.389 1516.4 RandomForestOptPred [virtual interface] 63.123 2019.9 RandomForestGeneric [virtual interface]

achoum commented 8 months ago

Thanks for the details and the patience :).

1.

50 μs / example / core is a slow speed, but it is possible for a large model. As I mentioned before, using gradient boosted trees would lead to significantly faster models.

2.

The histogram you shared indicates that most of the forest paths go to full depth (15). While this is common in large datasets, on small datasets, it could indicate that the model is spending a lot of energy by failing to generate features. This type of problem is typically observed when feeding string IDs (e.g. example id, user id) to the model. Example IDs are not generalizable. Looking at the other histogram in the model description can help you identify which feature is at fault.

3.

This benchmark code has a few issues.

A simple and accurate way to run a benchmark is to use the model.benchmark() method in the Python API: https://ydf.readthedocs.io/en/latest/tutorial/getting_started/#benchmark-model-speed