google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
470 stars 49 forks source link

Can Yggdrasil run on 8-bit? #20

Closed JoseAF closed 9 months ago

JoseAF commented 2 years ago

Hi

I am using a GBT model with the quickscorer algorithm. For training, I create a yggdrasil_decision_forests::dataset::VerticalDataset and use the AppendExample interface for which I have to convert the features to string (very slow!). Then I train an AbstractLearner using the dataset. For prediction, I cast the abstract model to GradientBoostedTreesBinaryClassificationQuickScorerExtended and create an ExampleSet which I then use to predict. So far I'm using the SetNumerical interface, which takes float, for creating the example set.

Since all my features come as uint8, I was wondering whether it is possible to use the Yggdrasil library (both training and prediction) in 8-bit mode directly, without having to convert either to string or to float? I haven't found interfaces to do this. If they are not available, is there a way to change/template the code to make this possible? Has anyone done this already? Do you envision problems with this?

Many thanks.

Arnold1 commented 2 years ago

@JoseAF do you have some code to show? there is https://github.com/achoum/ardwino-tensorflow-decision-forests btw

JoseAF commented 2 years ago

@Arnold1 thanks for the reply but I'm currently using the C++ library directly without intermediaries

achoum commented 2 years ago

Hi,

Following is a copy of the email exchange and the follow-up.

Email answer:

YDF supports float32 (called NUMERICAL) and int16 (called DISCRETIZED_NUMERICAL) representation for numerical features. While they are faster and consume less memory, DISCRETIZED_NUMERICAL also has some usage caveats (i.e. float32 is easier to handle).

Would casting your uint8 values to float32 or int16 work in your case?

Regarding the AppendExample interface (for training), you have two faster solutions:

1) Use the version of AppendExample that takes as input a proto::Example.

2) Allocate the VerticalDataset and fill directy the std::vector that contains the numerical values of your feature (accessor 1 and 2).

End of email answer:

For context, are you trying to increase the speed of training, inference or both (and if so, which on is currently the bottleneck)?

  1. Example of discretized_numerical

This is done by manually specifying the type of a feature to DISCRETIZED_NUMERICAL in the DataspecGuide. This guide is an argument of the CreateDataSpec function (c++ API, example) or the infer_dataspec command (CLI api, example). In the guide, you can set individual feature types, or force all the features to be detected as discretized.

In c++, this will look as follow:

proto::DataSpecificationGuide guide;
guide.set_detect_numerical_as_discretized_numerical(true);
ygg::dataset::CreateDataSpec(train_dataset_path, false, /*guide=*/guide,&dataspec);

The precision of discretized features is controlled by DiscretizedNumericalIndex. In your case, you can probably change it to 8 bits.

Discretized numerical will speedup training, but they currently have no impact on inference.

  1. Inference speed-up

The inference code is here.

Using 8bits numerical features would likely speedup the inference code (to be benchmarked), but this is currently not implemented (i.e. this is a feature request). If you are interested, we can discuss it, but it is unlikely we will be able to implement it in the short term.

Note: The model inference code is complex because it supports all the types of models and conditions. The code for a specific model only using numerical features would be significantly simpler. Also, a code for a specific model / data distribution could possibly be even faster. Maybe we could do that for your case (I can provide some guidance).

JoseAF commented 2 years ago

Hi

Thanks a lot for the quick reply. Your comments are very helpful.

I'm interested in speeding up both training and inference. I have yet to look into training performance but your comments about using discretized_numerical and also loading the data in bulk without string conversion are good ideas that will surely help. I'll give those a go soon.

Regarding inference, it's a pity that an 8bits option is not available. I would be interested in pursuing the options you mention at the end of the message if you have the time to provide some guidance. Let me know!

Thanks again for your time.

achoum commented 2 years ago

Hi Jose,

I'm replying here with what might be valuable to other users. I'll reply to your email with specifics.

Here are some initial thoughts on how to speed-up the inference. Happy to further discuss / details them after.

1. The first and simplest approach to speed-up the model inference is to multi-thread the inference code. We don't do it in the inference code directly as it is generally more efficient to do it in the caller library. Note that the inference code is thread safe. Here is how the model evaluation code uses multi-threading to speed-up the model inference.

2. Inference on GPU of decision forests is very fast (but it requires a GPU). In some old experiments I did, I observed a ~30x speed-up when comparing the CPU version (single-threaded) with a basic GPU implementation.

3. There is often a positive relation between the model size and the inference speed, and a negative relation between the model size and the model quality. If the model speed is important to you, you need to take it into account in your modeling i.e. don't just optimize for quality.

Let's focus on the CPU implementation from the simplest to the most complex solutions:

First, make sure to profile the inference code on your dataset to have an idea of where the time is spent. There are likely specific optimizations for your dataset/model. For example, the logic for oblique conditions is very different from the logic for axis-aligned conditions.

4. If I understand correctly, you apply your model on a set of overlapping image patches (like the Kinect? :)). In this case, optimizing the data IO might be important. Note that if the inference is done on GPU, the patch extraction logic can be incorporated in the same GPU code making the inference.

  1. In the documentation of the inference engine, the example values are fed one value at a time (e.g. using SetNumerical). This is generally not a large cost. However, if you have a lot of feature values, it will be significantly more efficient to feed feature values directly into the InternalCategoricalAndNumericalValues buffer.

This is probably the inference code used by your model (make sure with profiling). The "PredictQuickScorerSequential" function is equivalent but less efficient (but simpler to understand).

6. If your dataset only contains numerical features, you can remove the "Dense contains conditions." section. If you don't train your model with oblique splits, you can remove the "Sparse contains conditions." section.

Let's assume that your model uses the "Is higher conditions." section i.e. your model only contains conditions of the type "x >= threshold".

The loop "for (const auto& item : is_higher_condition.items) {" is checking all the unique threshold values (4 at a time) in order until one of them is greater than the feature value in the example. See the QS papers for mode details.

7. If your CPUs have support for AVX-512, you can probably check 8 thresholds at a time.

8. If the number of possible feature values is small (256 in your case), it could be (always profile to make sure) more efficient to have an array with 256 values that contains the end index of this for loop.

So instead of doing:

for threshold in thresholds
    if threshold > feature_value
        stop
    apply mask

You would do

end_threshold_idx = NEW_TABLE[feature_value]
for i in 0...end_threshold_idx
    apply mask
janpfeifer commented 2 years ago

Just to add one extra obvious approach:

  1. Your problem seems to be highly parallelizable: if (1) running in multiple threads is not enough, you may want to use something like MPI to further parallelize inference across different CPUs.
janpfeifer commented 2 years ago

Also:

  1. Very out there but: we never braved doing inference in FPGAs, but we suspect it could be made very fast and efficient. In case you are aim at dedicated hardware.
JoseAF commented 2 years ago

Hi

First of all, thanks a lot for all the comments and suggestions. I have followed some of them and am happy enough for now with inference so I've moved on to improving the training speeds.

Currently I'm a bit stuck on trying to implement an 8-bit solution for training and would appreciate some direction. I'm following the example of discretized numerical mentioned above. I have already changed DiscretizedNumericalIndex to 8 bits. However, since I create my dataset directly in code instead of importing it from a file, it seems I cannot really use the CreateDataSpec interface and I have been trying to work on an alternative.

At the moment, I create a dataspec, add the columns and set their type to DISCRETIZED_NUMERICAL using the yggdrasil_decision_forests::dataset::AddColumn method call. Then, it seems I need to set the boundaries per column using each column's histogram (its values and counts). For this, I have been trying to use the GenDiscretizedBoundaries interface but I'm unclear about whether I'm using it properly. Say I have 3 columns with values:

Column1: { 9, 11, 9, 9, 23, 50, 10 } Column2: { 11, 23, 12, 15, 10, 55, 12 } Column3: { 8, 12, 12, 12, 8, 8, 8 }

Histogram1: { (9, 3), (10, 1), (11, 1), (23, 1), (50, 1) } Histogram2: { (10, 1), (11, 1), (12, 2), (15, 1), (23, 1), (55, 1) } Histogram3: { (8, 4), (12, 3) }

Assuming all columns will always contain 8-bit values, how should I call this method for each column? Is this correct?

For Column 1: auto Bounds1 = GenDiscretizedBoundaries(Histogram1, /maximum_num_bins/ 5, /min_obs_in_bins/ 1, {}); For Column 2: auto Bounds2 = GenDiscretizedBoundaries(Histogram2, 6, 1, {}); For Column 3: auto Bounds3 = GenDiscretizedBoundaries(Histogram3, 2, 3, {});

After getting the boundaries, I add them to each column using this interface:

*ColumnX->mutable_discretized_numerical()->mutable_boundaries() = { vBoundsX.begin(), vBoundsX.end() };

Then, I'm not certain if I need to initialize the max number of bins per column and the original number of unique values per column. Assuming that I do, I call (following the toy example above):

Column1->mutable_discretized_numerical()->set_maximum_num_bins(5); Column1->mutable_discretized_numerical()->set_original_num_unique_values(5);

Column2->mutable_discretized_numerical()->set_maximum_num_bins(6); Column2->mutable_discretized_numerical()->set_original_num_unique_values(6);

Column2->mutable_discretized_numerical()->set_maximum_num_bins(2); Column2->mutable_discretized_numerical()->set_original_num_unique_values(2);

This dataspec is then added to the dataset:

Dataset.set_data_spec(DataSpec); Dataset.CreateColumnsFromDataspec();

And the dataset is then populated directly.

Following the above algorithm, however, causes TrainWithStatus to break with no messages. I'm pretty sure the problem is with the creation of the dataspec so I assume I've done something wrong somewhere. Any suggestions would be greatly appreciated.

Sorry for the long message.

Arnold1 commented 2 years ago

@JoseAF do you have a GitHub with code to share?

JoseAF commented 2 years ago

@Arnold1 I'm afraid I don't - sorry

JoseAF commented 2 years ago

A short follow up:

The only way I have managed to get this to run through is by setting all boundaries the same for all features, choosing the one with maximum resolution and applying it to the rest (e.g. in the example above, applying the boundaries of column 2 to all of them). In this case, as I say, the training runs fine with no errors. However, the results are worse than with float (i.e. higher errors and worse visually). I would not expect any changes to the errors, though, as the discretized feature values are exactly the same as the original ones. Could I perhaps be dealing with a bug here?

JoseAF commented 2 years ago

Ok - reverting to restore 16-bit implementation fixes all the issues. It seems the code is not ready yet to handle 8-bit.

Thanks for listening :)

achoum commented 9 months ago

Support for fast inference with 8bits precision has been implemented. Here is an example.