Open hkvision opened 1 year ago
Does this happen after direct training or loading a model then training ?
Does this happen after direct training or loading a model then training ?
Still asking for more info.
A possible doubt: When converting Spark DataFrame to XShards, if we use batch_size as shard size, each sub-partition is smaller, this will possibly impact the calculation of metrics that involve orders such as NDCG and MAP? Related PR: https://github.com/intel-analytics/BigDL/pull/6879
Update: this won't affect the result since either for one large partition or several sub-partitions, eventually we will combine all the data for one worker as an entire ndarray -> TFDataset and put into model.evaluate. Using batch_size as shard_size will benefit predict mainly since only one subpartition with size batch will be predicted each time.
@hkvision Thanks for adding the issue!
Here is the case when loading the trained model before evaluate()
and results are:
[{'validation_ndcg@5': 0.3261723816394806, 'validation_ndcg@10': 0.4041267931461334, 'validation_ndcg': 0.4779358208179474, 'validation_MAP@5': 0.26819130778312683, 'validation_MAP@10': 0.30234959721565247, 'validation_MAP': 0.32502391934394836}]
{'validation_ndcg@5': 10.573517799377441, 'validation_ndcg@10': 13.0294771194458, 'validation_ndcg': 15.378532409667969, 'validation_MAP@5': 8.696903228759766, 'validation_MAP@10': 9.779326438903809, 'validation_MAP': 10.503388404846191}
However the results divided by the number of the nodes looks close.
print(NUM_NODES)
for k, v in stats.items():
print(k, v / NUM_NODES)
32
validation_ndcg@5 0.33042243123054504
validation_ndcg@10 0.4071711599826813
validation_ndcg 0.480579137802124
validation_MAP@5 0.2717782258987427
validation_MAP@10 0.305603951215744
validation_MAP 0.3282308876514435
The code is like just as follows:
# est.load(model_path) # for < v2.2.0
est.load_checkpoint(model_path) # for v2.2.0
stats = est.evaluate(
sdf_valid,
batch_size=batch_size,
feature_cols=feature_cols,
label_cols=label_cols,
num_steps=1,
verbose=False
)
As we test, seems defining a model in pure tf.keras with ndcg won't get this problem. Test example: https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/NCF/tf_train_spark_dataframe.py Result: (Seems during the evaluate, the value is accumulated, but in the very end tf will automatically divide by the number of workers)
95/97 [============================>.] - ETA: 0s - loss: 7.0944 - accuracy: 2.0461 - auc: 4.9222 - precision: 2.0317 - recall: 9.9705 - NDCG@5: 2.0325
96/97 [============================>.] - ETA: 0s - loss: 7.0945 - accuracy: 2.0454 - auc: 4.9190 - precision: 2.0310 - recall: 9.9706 - NDCG@5: 2.0318
97/97 [==============================] - ETA: 0s - loss: 7.0945 - accuracy: 2.0455 - auc: 4.9188 - precision: 2.0310 - recall: 9.9708 - NDCG@5: 2.0318
97/97 [==============================] - 15s 99ms/step - loss: 0.7094 - accuracy: 0.2045 - auc: 0.4919 - precision: 0.2031 - recall: 0.9971 - NDCG@5: 0.2032
But the issue exists when using tfrs, still looking into this. Testing example: https://github.com/intel-analytics/BigDL/blob/main/python/friesian/example/listwise_ranking/listwise_ranking.py
Possible directions for this issue: rollback to BigDL in 2022/05 to see if this issue exists; check tf/tfrs/tfr versions.
Hi @nyamashi Are you still using the below versions of packages?
import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_ranking as tfr
print(tf.__version__)
print(tfr.__version__)
print(tfrs.__version__)
2.7.0
0.5.0.dev
v0.6.0
@hkvision Thanks for your support!
I've confirmed tf-related versions:
import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_ranking as tfr
print(tf.version) print(tfr.version) print(tfrs.version)
2.11.0 0.5.1.dev v0.7.2
- 2022/05
2.9.0 0.5.0.dev v0.6.0
I had trained the same model on a single-node-multiple-gpu environment with
2.11.0 0.5.1.dev v0.7.2
so I didn't realized such possibility. I'll rollback these packages with bigdl==2.2.0 and check the outcome.
We verified that this issue is actually caused by TensorFlow 2.11.0, older versions like 2.9.0 doesn't have this problem. The behavior is as follows:
95/97 [============================>.] - ETA: 0s - loss: 7.0944 - accuracy: 2.0461 - auc: 4.9222 - precision: 2.0317 - recall: 9.9705 - NDCG@5: 2.0325
96/97 [============================>.] - ETA: 0s - loss: 7.0945 - accuracy: 2.0454 - auc: 4.9190 - precision: 2.0310 - recall: 9.9706 - NDCG@5: 2.0318
97/97 [==============================] - ETA: 0s - loss: 7.0945 - accuracy: 2.0455 - auc: 4.9188 - precision: 2.0310 - recall: 9.9708 - NDCG@5: 2.0318
97/97 [==============================] - 15s 99ms/step - loss: 0.7094 - accuracy: 0.2045 - auc: 0.4919 - precision: 0.2031 - recall: 0.9971 - NDCG@5: 0.2032
Another important issue we detect is that, when you install tensorflow_ranking (e.g. 0.5.0), it has the dependency of tensorflow-serving-api, and tensorflow-serving-api relies on tensorflow. If tensorflow-serving-api is not installed in the environment, it will automatically install the latest version 2.11.0, and thus it will force you to upgrade your tensorflow to 2.11.0. This leads to an unreasonable behavior that installing tensorflow_ranking 0.5.0 will require tensorflow 2.11.0, but at the release time of tensorflow_ranking 0.5.0, tensorflow 2.11.0 is not released at all... Seems it is like a tensorflow issue of not restricting the upper version bound. Previously I encountered a similar issue when installing tensorflow 2.6.0: https://github.com/intel-analytics/BigDL/pull/7106 So I believe that you are forced to upgrade tensorflow to 2.11.0 right?
Since 2.11.0 behaves differently as 2.9.0, we may need to do further test to support 2.11.0. To avoid the dependency issue discussed above, as a workaround, we can first install tensorflow 2.9.0 and tensorflow-serving-api 2.9.0 as well. Then installing tensorflow_ranking won't force you to use tensorflow 2.11.0.
@nyamashi Are the above explanations clear to you? Any further inputs from your side?
I've confirmed with
load_checkpoint()
and evaluate()
like https://github.com/intel-analytics/BigDL/issues/7386#issuecomment-1411618802.The outcome is:
{'validation_ndcg@5': 0.32869571447372437, 'validation_ndcg@10': 0.40685272216796875, 'validation_ndcg': 0.47980841994285583, 'validation_MAP@5': 0.27017077803611755, 'validation_MAP@10': 0.3045196235179901, 'validation_MAP': 0.3269171416759491}
The values are quite similar with the result of the case 2022/05 of https://github.com/intel-analytics/BigDL/issues/7386#issuecomment-1411618802. So this issue is definitely cased by tf or/and tf-related packages.
@nyamashi Thanks for confirming this! So you may still use tensorflow 2.9.0 at this moment and we will have a separate issue (https://github.com/intel-analytics/BigDL/issues/7402) of checking our support for tensorflow 2.11.0. Is it OK on your side? 😄
@hkvision
👍 Yes, I'm OK! I'll use tensorflow==2.9.0
for my project with bigdl==2.2.0
and be watching the issue https://github.com/intel-analytics/BigDL/issues/7402.
I really appreciate your support!
Sure, no problem! Thank you for pointing out this issue, actually we haven't been aware of this previously😂 Will notify you when we have done with tensorflow 2.11 :)
Reported by the customer: When upgrading the BigDL package to 2.2.0 from 2.1.0b20220519, tf2.Estimator.evaluate() looks not correct.
@lalalapotter @sgwhat Check if we have changed anything from 20220519 to 2.2.0?