Can we support different batch sizes for training and validation in TFDataSet

intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

https://analytics-zoo.readthedocs.io/

Apache License 2.0

17 stars 4 forks source link

Can we support different batch sizes for training and validation in TFDataSet #782

Open jenniew opened 4 years ago

jenniew commented 4 years ago

Our current TFDataSet only use same batch size for training data and validation data. But in some cases(i.e. ncf), there is need to set different batch sizes for training and validation. Can we support different batch_size?

jenniew commented 4 years ago

The current python zoo.pipeline.estimator.Estimator and Keras style model also only support one batch size for training data and validation data in training

yangw1234 commented 4 years ago

@jenniew @jason-dai I do not think support different batch sizes alone can solve the ncf notebook evaluation problem.

In DistriOptimizer we will repartition the data rdd into node number partitions, so that we can using zipPartitions with models to do validation. So even though can set the set the validation dataset's batch, the original records are still randomly distributed in different batches.

I think it might be easier to find a way to implement the metrics (hit ratio or ndcg) without assuming each batch has exactly 1 positive example.

jenniew commented 4 years ago

Yes, the order cannot be kept if we coalesce in validation dataset. If we repartition to #nodes partitions, then add negative, is it possible for optimizer to not coalesce if rdd already has #nodes partitions? Currently we cannot do that in optimizer. But I think the batch size needs to be test_neg+1 still each time if we want to follow original validation design.

We can implement metrics to process multiple positive examples, but the metrics result is not same as the original result.

For this case, it is hard to get the same metrics result as old ones.

jason-dai commented 4 years ago

It seems that we do have a distributed implementation of NDCG and HitRatio (https://github.com/intel-analytics/analytics-zoo/blob/master/zoo/src/main/scala/com/intel/analytics/zoo/models/common/Ranker.scala#L113 and https://github.com/intel-analytics/BigDL/blob/master/spark/dl/src/main/scala/com/intel/analytics/bigdl/optim/ValidationMethod.scala#L883)

jenniew commented 4 years ago

HitRatio is the same as hit in NCF case which require one positive and test_neg negative samples in one batch. Zoo NDCG needs to change test label in ncf test dataset.

jason-dai commented 4 years ago

HitRatio is the same as hit in NCF case which require one positive and test_neg negative samples in one batch. Zoo NDCG needs to change test label in ncf test dataset.

Then how does we guarantee HitRatio work correctly in BigDL?

jenniew commented 4 years ago

Seems no validation usage in BigDL/Zoo repo.

jenniew commented 4 years ago

maybe we can pack neg+1 element as one element, before validation, use transformer to unpack then validate?