intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.47k stars 1.24k forks source link

[Orca] 2.2.0 doesn't converge for tensorflow2 estimator evaluate #7386

Open hkvision opened 1 year ago

hkvision commented 1 year ago

Reported by the customer: When upgrading the BigDL package to 2.2.0 from 2.1.0b20220519, tf2.Estimator.evaluate() looks not correct.

2.1.0b20220519 (looks good. less than one.)
[{'validation_ndcg@5': 0.19312456250190735, 'validation_ndcg@10': 0.27864694595336914, 'validation_ndcg': 0.39072680473327637, 'validation_MAP@5': 0.1501678079366684, 'validation_MAP@10': 0.18642768263816833, 'validation_MAP': 0.2198168933391571}]

2.2.0
{'validation_ndcg@5': 4.35667610168457, 'validation_ndcg@10': 6.832152843475342, 'validation_ndcg': 11.26124095916748, 'validation_MAP@5': 3.294984817504883, 'validation_MAP@10': 4.330305099487305, 'validation_MAP': 5.608747959136963}

@lalalapotter @sgwhat Check if we have changed anything from 20220519 to 2.2.0?

sgwhat commented 1 year ago

Does this happen after direct training or loading a model then training ?

hkvision commented 1 year ago

Does this happen after direct training or loading a model then training ?

Still asking for more info.

hkvision commented 1 year ago

A possible doubt: When converting Spark DataFrame to XShards, if we use batch_size as shard size, each sub-partition is smaller, this will possibly impact the calculation of metrics that involve orders such as NDCG and MAP? Related PR: https://github.com/intel-analytics/BigDL/pull/6879


Update: this won't affect the result since either for one large partition or several sub-partitions, eventually we will combine all the data for one worker as an entire ndarray -> TFDataset and put into model.evaluate. Using batch_size as shard_size will benefit predict mainly since only one subpartition with size batch will be predicted each time.

nyamashi commented 1 year ago

@hkvision Thanks for adding the issue!

Here is the case when loading the trained model before evaluate() and results are:

The code is like just as follows:

# est.load(model_path) # for < v2.2.0
est.load_checkpoint(model_path) # for v2.2.0

stats = est.evaluate(
    sdf_valid,
    batch_size=batch_size,
    feature_cols=feature_cols,
    label_cols=label_cols,
    num_steps=1, 
    verbose=False
)
hkvision commented 1 year ago

As we test, seems defining a model in pure tf.keras with ndcg won't get this problem. Test example: https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/NCF/tf_train_spark_dataframe.py Result: (Seems during the evaluate, the value is accumulated, but in the very end tf will automatically divide by the number of workers)

95/97 [============================>.] - ETA: 0s - loss: 7.0944 - accuracy: 2.0461 - auc: 4.9222 - precision: 2.0317 - recall: 9.9705 - NDCG@5: 2.0325
96/97 [============================>.] - ETA: 0s - loss: 7.0945 - accuracy: 2.0454 - auc: 4.9190 - precision: 2.0310 - recall: 9.9706 - NDCG@5: 2.0318
97/97 [==============================] - ETA: 0s - loss: 7.0945 - accuracy: 2.0455 - auc: 4.9188 - precision: 2.0310 - recall: 9.9708 - NDCG@5: 2.0318
97/97 [==============================] - 15s 99ms/step - loss: 0.7094 - accuracy: 0.2045 - auc: 0.4919 - precision: 0.2031 - recall: 0.9971 - NDCG@5: 0.2032

But the issue exists when using tfrs, still looking into this. Testing example: https://github.com/intel-analytics/BigDL/blob/main/python/friesian/example/listwise_ranking/listwise_ranking.py

Possible directions for this issue: rollback to BigDL in 2022/05 to see if this issue exists; check tf/tfrs/tfr versions.

hkvision commented 1 year ago

Hi @nyamashi Are you still using the below versions of packages?

import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_ranking as tfr

print(tf.__version__)
print(tfr.__version__)
print(tfrs.__version__)

2.7.0
0.5.0.dev
v0.6.0
nyamashi commented 1 year ago

@hkvision Thanks for your support!

I've confirmed tf-related versions:

print(tf.version) print(tfr.version) print(tfrs.version)

2.11.0 0.5.1.dev v0.7.2


- 2022/05

2.9.0 0.5.0.dev v0.6.0


I had trained the same model on a single-node-multiple-gpu environment with

2.11.0 0.5.1.dev v0.7.2


so I didn't realized such possibility. I'll rollback these packages with bigdl==2.2.0 and check the outcome.
hkvision commented 1 year ago

We verified that this issue is actually caused by TensorFlow 2.11.0, older versions like 2.9.0 doesn't have this problem. The behavior is as follows:

Since 2.11.0 behaves differently as 2.9.0, we may need to do further test to support 2.11.0. To avoid the dependency issue discussed above, as a workaround, we can first install tensorflow 2.9.0 and tensorflow-serving-api 2.9.0 as well. Then installing tensorflow_ranking won't force you to use tensorflow 2.11.0.

@nyamashi Are the above explanations clear to you? Any further inputs from your side?

nyamashi commented 1 year ago

I've confirmed with

The outcome is:

{'validation_ndcg@5': 0.32869571447372437, 'validation_ndcg@10': 0.40685272216796875, 'validation_ndcg': 0.47980841994285583, 'validation_MAP@5': 0.27017077803611755, 'validation_MAP@10': 0.3045196235179901, 'validation_MAP': 0.3269171416759491}

The values are quite similar with the result of the case 2022/05 of https://github.com/intel-analytics/BigDL/issues/7386#issuecomment-1411618802. So this issue is definitely cased by tf or/and tf-related packages.

hkvision commented 1 year ago

@nyamashi Thanks for confirming this! So you may still use tensorflow 2.9.0 at this moment and we will have a separate issue (https://github.com/intel-analytics/BigDL/issues/7402) of checking our support for tensorflow 2.11.0. Is it OK on your side? 😄

nyamashi commented 1 year ago

@hkvision
👍 Yes, I'm OK! I'll use tensorflow==2.9.0 for my project with bigdl==2.2.0 and be watching the issue https://github.com/intel-analytics/BigDL/issues/7402. I really appreciate your support!

hkvision commented 1 year ago

Sure, no problem! Thank you for pointing out this issue, actually we haven't been aware of this previously😂 Will notify you when we have done with tensorflow 2.11 :)