[Orca] 2.2.0 doesn't converge for tensorflow2 estimator evaluate

hkvision commented 1 year ago

Reported by the customer: When upgrading the BigDL package to 2.2.0 from 2.1.0b20220519, tf2.Estimator.evaluate() looks not correct.

2.1.0b20220519 (looks good. less than one.)
[{'validation_ndcg@5': 0.19312456250190735, 'validation_ndcg@10': 0.27864694595336914, 'validation_ndcg': 0.39072680473327637, 'validation_MAP@5': 0.1501678079366684, 'validation_MAP@10': 0.18642768263816833, 'validation_MAP': 0.2198168933391571}]

2.2.0
{'validation_ndcg@5': 4.35667610168457, 'validation_ndcg@10': 6.832152843475342, 'validation_ndcg': 11.26124095916748, 'validation_MAP@5': 3.294984817504883, 'validation_MAP@10': 4.330305099487305, 'validation_MAP': 5.608747959136963}

@lalalapotter @sgwhat Check if we have changed anything from 20220519 to 2.2.0?

sgwhat commented 1 year ago

Does this happen after direct training or loading a model then training ?

hkvision commented 1 year ago

Does this happen after direct training or loading a model then training ?

Still asking for more info.

hkvision commented 1 year ago

A possible doubt: When converting Spark DataFrame to XShards, if we use batch_size as shard size, each sub-partition is smaller, this will possibly impact the calculation of metrics that involve orders such as NDCG and MAP? Related PR: https://github.com/intel-analytics/BigDL/pull/6879

Update: this won't affect the result since either for one large partition or several sub-partitions, eventually we will combine all the data for one worker as an entire ndarray -> TFDataset and put into model.evaluate. Using batch_size as shard_size will benefit predict mainly since only one subpartition with size batch will be predicted each time.

nyamashi commented 1 year ago

@hkvision Thanks for adding the issue!

Here is the case when loading the trained model before evaluate() and results are:

2.1.0b20220519 (looks good. less than one.)

[{'validation_ndcg@5': 0.3261723816394806, 'validation_ndcg@10': 0.4041267931461334, 'validation_ndcg': 0.4779358208179474, 'validation_MAP@5': 0.26819130778312683, 'validation_MAP@10': 0.30234959721565247, 'validation_MAP': 0.32502391934394836}]

2.2.0

{'validation_ndcg@5': 10.573517799377441, 'validation_ndcg@10': 13.0294771194458, 'validation_ndcg': 15.378532409667969, 'validation_MAP@5': 8.696903228759766, 'validation_MAP@10': 9.779326438903809, 'validation_MAP': 10.503388404846191}

However the results divided by the number of the nodes looks close.

print(NUM_NODES)
for k, v in stats.items():
print(k, v / NUM_NODES)
32
validation_ndcg@5 0.33042243123054504
validation_ndcg@10 0.4071711599826813
validation_ndcg 0.480579137802124
validation_MAP@5 0.2717782258987427
validation_MAP@10 0.305603951215744
validation_MAP 0.3282308876514435

The code is like just as follows:

# est.load(model_path) # for < v2.2.0
est.load_checkpoint(model_path) # for v2.2.0

stats = est.evaluate(
    sdf_valid,
    batch_size=batch_size,
    feature_cols=feature_cols,
    label_cols=label_cols,
    num_steps=1, 
    verbose=False
)

hkvision commented 1 year ago

As we test, seems defining a model in pure tf.keras with ndcg won't get this problem. Test example: https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/NCF/tf_train_spark_dataframe.py Result: (Seems during the evaluate, the value is accumulated, but in the very end tf will automatically divide by the number of workers)

95/97 [============================>.] - ETA: 0s - loss: 7.0944 - accuracy: 2.0461 - auc: 4.9222 - precision: 2.0317 - recall: 9.9705 - NDCG@5: 2.0325
96/97 [============================>.] - ETA: 0s - loss: 7.0945 - accuracy: 2.0454 - auc: 4.9190 - precision: 2.0310 - recall: 9.9706 - NDCG@5: 2.0318
97/97 [==============================] - ETA: 0s - loss: 7.0945 - accuracy: 2.0455 - auc: 4.9188 - precision: 2.0310 - recall: 9.9708 - NDCG@5: 2.0318
97/97 [==============================] - 15s 99ms/step - loss: 0.7094 - accuracy: 0.2045 - auc: 0.4919 - precision: 0.2031 - recall: 0.9971 - NDCG@5: 0.2032

But the issue exists when using tfrs, still looking into this. Testing example: https://github.com/intel-analytics/BigDL/blob/main/python/friesian/example/listwise_ranking/listwise_ranking.py

Possible directions for this issue: rollback to BigDL in 2022/05 to see if this issue exists; check tf/tfrs/tfr versions.

hkvision commented 1 year ago

Hi @nyamashi Are you still using the below versions of packages?

import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_ranking as tfr

print(tf.__version__)
print(tfr.__version__)
print(tfrs.__version__)

2.7.0
0.5.0.dev
v0.6.0

nyamashi commented 1 year ago

@hkvision Thanks for your support!

I've confirmed tf-related versions:

2.2.0


import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_ranking as tfr

print(tf.version) print(tfr.version) print(tfrs.version)

2.11.0 0.5.1.dev v0.7.2


- 2022/05

2.9.0 0.5.0.dev v0.6.0


I had trained the same model on a single-node-multiple-gpu environment with

2.11.0 0.5.1.dev v0.7.2


so I didn't realized such possibility. I'll rollback these packages with bigdl==2.2.0 and check the outcome.

hkvision commented 1 year ago

We verified that this issue is actually caused by TensorFlow 2.11.0, older versions like 2.9.0 doesn't have this problem. The behavior is as follows:

When using 2.9.0, all the values are already averaged and thus no obvious problem detected.

When using 2.11.0, for pure tensorflow model, the intermediate results are summed (not only ndcg, but accuracy as well, actually not much make sense [not sure is an issue of tensorflow or the issue of orca when utilizing tensorflow]), but the final result is averaged.

95/97 [============================>.] - ETA: 0s - loss: 7.0944 - accuracy: 2.0461 - auc: 4.9222 - precision: 2.0317 - recall: 9.9705 - NDCG@5: 2.0325
96/97 [============================>.] - ETA: 0s - loss: 7.0945 - accuracy: 2.0454 - auc: 4.9190 - precision: 2.0310 - recall: 9.9706 - NDCG@5: 2.0318
97/97 [==============================] - ETA: 0s - loss: 7.0945 - accuracy: 2.0455 - auc: 4.9188 - precision: 2.0310 - recall: 9.9708 - NDCG@5: 2.0318
97/97 [==============================] - 15s 99ms/step - loss: 0.7094 - accuracy: 0.2045 - auc: 0.4919 - precision: 0.2031 - recall: 0.9971 - NDCG@5: 0.2032

When using 2.11.0 with tensorflow recommenders, we can reproduce your result, both the intermediate results and final result are summed. Haven't figured out why the behavior of pure tensorflow and tfrs are different yet.

Another important issue we detect is that, when you install tensorflow_ranking (e.g. 0.5.0), it has the dependency of tensorflow-serving-api, and tensorflow-serving-api relies on tensorflow. If tensorflow-serving-api is not installed in the environment, it will automatically install the latest version 2.11.0, and thus it will force you to upgrade your tensorflow to 2.11.0. This leads to an unreasonable behavior that installing tensorflow_ranking 0.5.0 will require tensorflow 2.11.0, but at the release time of tensorflow_ranking 0.5.0, tensorflow 2.11.0 is not released at all... Seems it is like a tensorflow issue of not restricting the upper version bound. Previously I encountered a similar issue when installing tensorflow 2.6.0: https://github.com/intel-analytics/BigDL/pull/7106 So I believe that you are forced to upgrade tensorflow to 2.11.0 right?

Since 2.11.0 behaves differently as 2.9.0, we may need to do further test to support 2.11.0. To avoid the dependency issue discussed above, as a workaround, we can first install tensorflow 2.9.0 and tensorflow-serving-api 2.9.0 as well. Then installing tensorflow_ranking won't force you to use tensorflow 2.11.0.

@nyamashi Are the above explanations clear to you? Any further inputs from your side?

nyamashi commented 1 year ago

I've confirmed with

bigdl==2.2.0
tensorflow==2.9.0
tensorflow_ranking=-0.5.1 (0.5.1.dev)
tensorflow_recommenders==v0.6.0 and calling load_checkpoint() and evaluate() like https://github.com/intel-analytics/BigDL/issues/7386#issuecomment-1411618802.

The outcome is:

{'validation_ndcg@5': 0.32869571447372437, 'validation_ndcg@10': 0.40685272216796875, 'validation_ndcg': 0.47980841994285583, 'validation_MAP@5': 0.27017077803611755, 'validation_MAP@10': 0.3045196235179901, 'validation_MAP': 0.3269171416759491}

The values are quite similar with the result of the case 2022/05 of https://github.com/intel-analytics/BigDL/issues/7386#issuecomment-1411618802. So this issue is definitely cased by tf or/and tf-related packages.

hkvision commented 1 year ago

@nyamashi Thanks for confirming this! So you may still use tensorflow 2.9.0 at this moment and we will have a separate issue (https://github.com/intel-analytics/BigDL/issues/7402) of checking our support for tensorflow 2.11.0. Is it OK on your side? 😄

nyamashi commented 1 year ago

@hkvision
👍 Yes, I'm OK! I'll use tensorflow==2.9.0 for my project with bigdl==2.2.0 and be watching the issue https://github.com/intel-analytics/BigDL/issues/7402. I really appreciate your support!

hkvision commented 1 year ago

Sure, no problem! Thank you for pointing out this issue, actually we haven't been aware of this previously😂 Will notify you when we have done with tensorflow 2.11 :)

intel-analytics / ipex-llm

[Orca] 2.2.0 doesn't converge for tensorflow2 estimator evaluate #7386

When using 2.11.0 with tensorflow recommenders, we can reproduce your result, both the intermediate results and final result are summed. Haven't figured out why the behavior of pure tensorflow and tfrs are different yet.