tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Apache License 2.0
658 stars 109 forks source link

TF 2.7.0 and TFDF 0.2.1 not compatible warning messages #67

Closed fbadine closed 1 year ago

fbadine commented 2 years ago

Hi,

I was working on a notebook on Colab using the following TF & TFDF versions respectively: 2.6 and 0.1.9 When Colab switched to TF 2.7.0 and TFDF 0.2.1 was released, I started getting the below warning although those 2 versions are compatible according to https://www.tensorflow.org/decision_forests/known_issues

The warning messages received in Colab is:

WARNING:root:Failure to load the custom c++ tensorflow ops. This error is likely caused the version of TensorFlow and TensorFlow Decision Forests are not compatible. WARNING:root:TF Parameter Server distributed training not available.

The versions are:

TensorFlow Version: 2.7.0 TensorFlow Decision Forests: 0.2.1

achoum commented 2 years ago

Hi Fadi,

Thanks for the report. Do you see an error message after those warnings? If so, can you share them? Colab instances need to be restarted when libraries (e.g. TF or TF-DF) are updated. Can you check this effectively happen?

Cheers, M.

fbadine commented 2 years ago

No error message. Just the warnings.

In fact I noticed those when I was running the following notebook and found out that on the newer TF and TFDF releases, training takes much longer time. That's when I noticed the warning and was wondering if it has something to do with it.

I re-ran the notebook using TF 2.6 and TFDF 0.1.9 and then again with TF 2.7 and TFDF 0.2.1.

Here are the results

CPU times: user 51min 31s, sys: 19.1 s, total: 51min 50s Wall time: 27min 47s

CPU times: user 2h 14min 21s, sys: 24.8 s, total: 2h 14min 46s Wall time: 1h 11min 33s

hongkahjun commented 2 years ago

similar issue for me using TF 2.7.0 with TFDF 0.2.1, but running on docker on my own machine, Docker uses tensorflow/tensorflow:latest image

and I got

WARNING:root:Failure to load the custom c++ tensorflow ops. This error is likely caused the version of TensorFlow and TensorFlow Decision Forests are not compatible.

and

[INFO kernel.cc:852] Use slow generic engine when loading my model

Also, no errors, only warnings.

achoum commented 2 years ago

Thanks both for the details! :)

Warning message

It seems the "Failure to load the custom c++ tensorflow ops" warning is a false positive. Namely, TF-DF fails to load the custom op for distributed training which is currently expected in the pre-build release. This has no impact for non-distributed training.

This warning will be removed in the next version.

fbadine : Increase in training time.

This is not very good and this is kind of unexpected. But one the other side should be related to the distribute op.

Could you share both colab runs on this dataset with "with sys_pipes():" around the "model.fit" call and the model summary?

hongkahjun

Background: Once the model is trained, TF-DF looks for the fastest algorithm to run the model. There are multiple solutions available with different speeds and coverage. If the only compatible algorithm is a specific one with 100% coverage but relatively slow speed, you get this warning message. Note that the model is still perfectly usable, but inference will be slower than with other options. This has not impact on training speed.

This situation should be relatively rare e.g. you create the model by hand, or you use some specific hyper-parameters combinations.

This could also be an error. Could you share the model summary with us?

achoum commented 2 years ago

A bit of extra details about the training speed.

In a public colab, I trained 8 times the exact same model (equality of the final model structure) and obtained the following wall times (in seconds): 19.5, 7.35, 19.1, 16.0, 7.9, 10.6, 7.32, 7.41.

Because the machines are shared among multiple users, training time on public colabs seems to have a lot of variance.

hongkahjun commented 2 years ago

hi Achoum thanks for the reply and clarifications

Apologies but i didnt mention before that i wasnt using colab, i was using this on my local machine and running it via docker. I actually created the model by hand using the builder example mentioned in one of the notebooks. Thus i do not have the model.fit command, and the warning [INFO kernel.cc:852] Use slow generic engine appears when I initialize the builder and load it from disk.

Is there a way to speed up inference by choosing the algorithm in some way?

fbadine commented 2 years ago

tfdf_0.1.9_training.txt tfdf_0.2.1_training.txt

Thanks @achoum!!

Attached are the model training logs (sys_pipes() around model.fit as well as the model.summary) for:

As for the running time on public Colab variance, it might be the cause but results so far show that for 0.1.9 it's always less than an hour. The attached run is 40 mins as opposed to 27 the previous run. However using 0.2.1 it's always more than an hour. The fastest I got in my tests is 1h 10 mins

achoum commented 2 years ago

Thanks @fbadine

There might be an issue, and your logs will be useful. I'll keep you in touch.

@hongkahjun

This part of the API is not very clear :|.

The most likely reason the fast engine is not used with manually created models is because the structure of the model does not look as if the model was trained with global imputation to handle missing values.

Essentially, this means that some relations between the default evaluation value of the conditions, the threshold values and the dataspec (the optional constructor argument of the model builder) should be true.

However, instead of me telling you how to best optimize it, let's just wait until the next release. I've updated the model builder. Now, if the dataspec is not set, it will be automatically configured as to make the model run fast (in most situations).

hongkahjun commented 2 years ago

Thanks achoum, I will monitor in future releases!

achoum commented 2 years ago

The optimization was integrated in the release. The "Use slow generic engine" error message should not be printed anymore on a model build with the builder API unless you explicitly specify the default evaluation value.

hongkahjun commented 2 years ago

thanks!

rstz commented 1 year ago

Closing this as obsolete