intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
16 stars 3 forks source link

A problem about TFOptimizer. #1223

Closed zzuzayu closed 5 years ago

zzuzayu commented 5 years ago

When I use TFOptimizer to train a tensorflow model by slim. I got a error.

x_rdd = sc.parallelize(images)
y_rdd = sc.parallelize(labels)
train_rdd = x_rdd.zip(y_rdd).map(lambda rec_tuple: [rec_tuple[0], np.array(rec_tuple[1])])

dataset = TFDataset.from_rdd(train_rdd,
                             names=["features", "label"],
                             shapes=[[SIZE_W, SIZE_H, 3], [1]],
                             types=[tf.float32, tf.int32])

data_images, data_labels = dataset.tensors
squeezed_labels = tf.squeeze(data_labels)
with slim.arg_scope(resnet_v1.resnet_arg_scope()):
     logits, end_points = resnet_v1.resnet_v1_200(data_images, num_classes=len(label_to_num), is_training=True)

loss = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=squeezed_labels))

from zoo.pipeline.api.net import TFOptimizer
from bigdl.optim.optimizer import MaxIteration, Adam, MaxEpoch, TrainSummary

optimizer = TFOptimizer(loss, Adam(1e-3))
optimizer.set_train_summary(TrainSummary("/tmp/resnet_v2", "train"))
optimizer.optimize(end_trigger=MaxEpoch(5))

I run the https://github.com/intel-analytics/analytics-zoo/blob/5212eb75956965fbedc64a0f0bb563bfc0b855b6/pyzoo/zoo/examples/tensorflow/distributed_training/train_lenet.py,get same error.

Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 44, localhost, executor driver): java.util.concurrent.ExecutionException: Layer info: TFTrainingHelper[44456754]/TFNet[5a094281]
java.lang.IllegalArgumentException: Incompatible shapes: [0] vs. [3]
     [[Node: sparse_softmax_cross_entropy_loss/xentropy/assert_equal/Equal = Equal[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sparse_softmax_cross_entropy_loss/xentropy/Shape_1, sparse_softmax_cross_entropy_loss/xentropy/strided_slice)]]
    at org.tensorflow.Session.run(Native Method)
    at org.tensorflow.Session.access$100(Session.java:48)
    at org.tensorflow.Session$Runner.runHelper(Session.java:298)
    at org.tensorflow.Session$Runner.run(Session.java:248)
    at com.intel.analytics.zoo.pipeline.api.net.TFNet.updateOutput(TFNet.scala:252)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
    at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:264)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:264)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:264)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:202)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: Layer info: TFTrainingHelper[44456754]/TFNet[5a094281]
java.lang.IllegalArgumentException: Incompatible shapes: [0] vs. [3]
     [[Node: sparse_softmax_cross_entropy_loss/xentropy/assert_equal/Equal = Equal[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sparse_softmax_cross_entropy_loss/xentropy/Shape_1, sparse_softmax_cross_entropy_loss/xentropy/strided_slice)]]
    at org.tensorflow.Session.run(Native Method)
    at org.tensorflow.Session.access$100(Session.java:48)
    at org.tensorflow.Session$Runner.runHelper(Session.java:298)
    at org.tensorflow.Session$Runner.run(Session.java:248)
    at com.intel.analytics.zoo.pipeline.api.net.TFNet.updateOutput(TFNet.scala:252)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
    at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:263)
    at com.intel.analytics.zoo.pipeline.api.net.TFTrainingHelper.updateOutput(TFTrainingHelper.scala:100)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:252)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:245)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$4.call(ThreadPool.scala:112)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more

Driver stacktrace:
zzuzayu commented 5 years ago

Sorry, I have already solved it. The logits should be tf.squeeze(logits).