intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

java.lang.IllegalArgumentException in image classification #1237

Closed ZeweiChen11 closed 5 years ago

ZeweiChen11 commented 6 years ago

We met errors when using examples/imageclassification/Predict.scala to predict inception v1 w/ ImageNet val. But it reported java.lang.IllegalArgumentException for 10k images and java.lang.ArrayIndexOutOfBoundsException for 5k images. Predicting 1000 images can pass.

Execution script:

#!/bin/sh
master="local[28]"
modelPath=/mnt/disk1/analytics-zoo-dataset/imageclassification/analytics-zoo_inception-v1_imagenet_0.1.0
imagePath=/mnt/disk1/analytics-zoo-dataset/imageclassification/imagenet/
ZOO_HOME=/root/analytics-zoo
ZOO_JAR_PATH=${ZOO_HOME}/dist/lib/analytics-zoo-0.1.0-SNAPSHOT-jar-with-dependencies.jar
spark-submit \
--verbose \
--master $master \
--conf spark.executor.cores=28 \
--conf spark.driver.maxResultSize=6g \
--total-executor-cores 28 \
--driver-memory 200g \
--executor-memory 40g \
--class com.intel.analytics.zoo.examples.imageclassification.Predict \
${ZOO_JAR_PATH} -f $imagePath --model $modelPath --partition 28 --topN 5

error when predicting 10000 images:

2018-05-24 15:07:37 INFO  ThreadPool$:79 - Set mkl threads to 1 on thread 1
2018-05-24 15:07:39 INFO  Engine$:103 - Auto detect executor number and executor cores number
2018-05-24 15:07:39 INFO  Engine$:105 - Executor number is 1 and executor cores number is 28
2018-05-24 15:07:39 INFO  Engine$:373 - Find existing spark context. Checking the spark conf...
[Stage 0:===============>                                          (3 + 8) / 11]2018-05-24 15:10:57 ERROR Executor:91 - Exception in task 3.0 in stage 0.0 (TID 3)
Layer info: ImageClassifier[analytics-zoo_inception-v1_imagenet_0.1.0]/SpatialConvolution[conv1/7x7_s2](3 -> 64, 7 x 7, 2, 2, 3, 3)
java.lang.IllegalArgumentException: requirement failed: input channel size 2 is not the same as nInputPlane 3
        at scala.Predef$.require(Predef.scala:224)
        at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:262)
        at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:54)
        at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:243)
        at com.intel.analytics.bigdl.nn.StaticGraph.updateOutput(StaticGraph.scala:59)
        at com.intel.analytics.zoo.models.common.ZooModel.updateOutput(ZooModel.scala:79)
        at com.intel.analytics.zoo.models.common.ZooModel.updateOutput(ZooModel.scala:79)
        at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:243)
        at com.intel.analytics.bigdl.optim.Predictor$$anonfun$predictSamples$1.apply(Predictor.scala:67)
        at com.intel.analytics.bigdl.optim.Predictor$$anonfun$predictSamples$1.apply(Predictor.scala:66)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:800)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at com.intel.analytics.bigdl.optim.Predictor$.predictImageBatch(Predictor.scala:48)

error when predicting 5000 images:

[Stage 0:>                                                          (0 + 4) / 5]
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy$mcF$sp(TensorNumeric.scala:721)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy(TensorNumeric.scala:715)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy(TensorNumeric.scala:503)
        at com.intel.analytics.bigdl.dataset.MiniBatch$.copy(MiniBatch.scala:460)
        at com.intel.analytics.bigdl.dataset.MiniBatch$.copyWithPadding(MiniBatch.scala:380)
        at com.intel.analytics.bigdl.dataset.ArrayTensorMiniBatch.set(MiniBatch.scala:209)
        at com.intel.analytics.bigdl.dataset.ArrayTensorMiniBatch.set(MiniBatch.scala:111)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:348)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
jason-dai commented 6 years ago

@jenniew please take a look

BhagyasriYella commented 6 years ago

Hi , I am also facing the same issue when am trying to Run object detection code with jupyter.

export SPARK_HOME=/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark export ANALYTICS_ZOO_HOME=/root/Desktop/analytics-zoo/dist MASTER=local[*] ${ANALYTICS_ZOO_HOME}/bin/jupyter-with-zoo.sh \

--master ${MASTER} \ --driver-cores 2 \ --driver-memory 8g \ --total-executor-cores 2 \ --executor-cores 2 \ --executor-memory 8g

As soon as I press enter am getting the error as follows:

Exception in thread "main" java.lang.IllegalArgumentException: pyspark does not support any application options. at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:242) at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:241) at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:117) at org.apache.spark.launcher.Main.main(Main.java:86)

What could be the possible reason?

@jason-dai Please help me out.

jason-dai commented 6 years ago

@dding3 please take a look.

dding3 commented 6 years ago

@BhagyasriYella Did you make any changes to jupyter-with-zoo.sh? If not, could you please check if pyspark work? ${SPARK_HOME}/bin/pyspark, then from pyspark import SparkContext

dding3 commented 6 years ago

Usually pyspark does not support any application options exception is caused by the options are not properly passed

BhagyasriYella commented 6 years ago

@dding3 I did not make any changes to jupyter-with-zoo.sh. I am new to this so please correct me if am wrong. I have tried ${SPARK_HOME}/bin/pyspark. it gave me following:

bigdatapoc01:~ # export SPARK_HOME=/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark bigdatapoc01:~ # ${SPARK_HOME}/bin/pyspark Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 12:22:00) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/shell.py", line 30, in import pyspark File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/init.py", line 41, in from pyspark.context import SparkContext File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/context.py", line 33, in from pyspark.java_gateway import launch_gateway File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/java_gateway.py", line 31, in from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "/root/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 18, in from pydoc import pager File "/root/anaconda3/lib/python3.6/pydoc.py", line 59, in import inspect File "/root/anaconda3/lib/python3.6/inspect.py", line 334, in Attribute = namedtuple('Attribute', 'name kind defining_class object') File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

from pyspark import SparkContext Traceback (most recent call last): File "", line 1, in File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/init.py", line 41, in from pyspark.context import SparkContext File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/context.py", line 33, in from pyspark.java_gateway import launch_gateway File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/java_gateway.py", line 31, in from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "/root/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 18, in from pydoc import pager File "/root/anaconda3/lib/python3.6/pydoc.py", line 59, in import inspect File "/root/anaconda3/lib/python3.6/inspect.py", line 334, in Attribute = namedtuple('Attribute', 'name kind defining_class object') File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

dding3 commented 6 years ago

I am afraid there is something wrong with your spark environment as there is exception when you start pyspark. We need fix it before run object detection notebook.

I noticed you are using python 3.6, what's your spark version, it can be checked by ${SPARK_HOME}/bin/spark-submit --version. Spark <= 2.1.0 is not compatible with Python 3.6. I found there is similar issue if run Spark <= 2.1 with python 3.6. https://stackoverflow.com/questions/42349980/unable-to-run-pyspark

zhichao-li commented 6 years ago

My wild guess is a problem of python version. It’s worth a try to use python 2.7 or python 3.5

发自我的 iPhone

在 2018年8月2日,下午6:26,BhagyasriYella notifications@github.com<mailto:notifications@github.com> 写道:

@dding3https://github.com/dding3 I did not make any changes to jupyter-with-zoo.sh. I am new to this so please correct me if am wrong. I have tried ${SPARK_HOME}/bin/pyspark. it gave me following:

bigdatapoc01:~ # export SPARK_HOME=/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark bigdatapoc01:~ # ${SPARK_HOME}/bin/pyspark Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 12:22:00) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/shell.py", line 30, in import pyspark File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/init.py", line 41, in from pyspark.context import SparkContext File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/context.py", line 33, in from pyspark.java_gateway import launch_gateway File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/java_gateway.py", line 31, in from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "/root/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 18, in from pydoc import pager File "/root/anaconda3/lib/python3.6/pydoc.py", line 59, in import inspect File "/root/anaconda3/lib/python3.6/inspect.py", line 334, in Attribute = namedtuple('Attribute', 'name kind defining_class object') File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

from pyspark import SparkContext Traceback (most recent call last): File "", line 1, in File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/init.py", line 41, in from pyspark.context import SparkContext File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/context.py", line 33, in from pyspark.java_gateway import launch_gateway File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/java_gateway.py", line 31, in from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "/root/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", line 18, in from pydoc import pager File "/root/anaconda3/lib/python3.6/pydoc.py", line 59, in import inspect File "/root/anaconda3/lib/python3.6/inspect.py", line 334, in Attribute = namedtuple('Attribute', 'name kind defining_class object') File "/opt/cloudera/parcels/CDH-5.13.2-1.cdh5.13.2.p0.3/lib/spark/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/intel-analytics/analytics-zoo-internal/issues/1237, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACiegVNn5RBfJFpwyweZQYR3xUO1VQFhks5uMtPDgaJpZM4ULvCC.

hkvision commented 6 years ago

Hi @BhagyasriYella The error you mentioned TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' is due to the incompatibility of earlier versions of Spark with Python 3.6. Please see this issue for more discussions: https://issues.apache.org/jira/browse/SPARK-19019 This has been fixed in Spark 1.6.4, 2.0.3, 2.1.1, 2.2.0. If you are using Python 2.6, it is recommended that you use spark>=2.2.0 to run. Would you mind switching your spark version and have another try? Thanks.

jenniew commented 6 years ago

The original issue is caused by bigdl issue : https://github.com/intel-analytics/BigDL/issues/2558

BhagyasriYella commented 6 years ago

@dding3 Thanks Ding It worked after I have updated the spark version. Thanks a lot.

dding3 commented 6 years ago

You are very welcome. We are glad to help :)

ZeweiChen11 commented 6 years ago

The same issue happens in examples/nnframes/inference. 50K imagenet val images:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 2.0 failed 4 times, most recent failure: Lost task 19.3 in stage 2.0 (TID 27, emr-worker-4.cluster-74716, executor 1): java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy$mcF$sp(TensorNumeric.scala:721)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy(TensorNumeric.scala:715)
        at com.intel.analytics.bigdl.tensor.TensorNumericMath$TensorNumeric$NumericFloat$.arraycopy(TensorNumeric.scala:503)
        at com.intel.analytics.bigdl.dataset.MiniBatch$.copy(MiniBatch.scala:460)
        at com.intel.analytics.bigdl.dataset.MiniBatch$.copyWithPadding(MiniBatch.scala:380)
        at com.intel.analytics.bigdl.dataset.ArrayTensorMiniBatch.set(MiniBatch.scala:209)
        at com.intel.analytics.bigdl.dataset.ArrayTensorMiniBatch.set(MiniBatch.scala:111)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:348)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:800)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)

        at org.apache.spark.sql.catalyst.expressions.GeneratedClass

10K imagenet val images

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 15, emr-worker-3.cluster-74716, executor 2): Layer info: StaticGraph[GoogleNet]/SpatialConvolution[conv1/7x7_s2](3 -> 64, 7 x 7, 2, 2, 3, 3)
java.lang.IllegalArgumentException: requirement failed: input channel size 30 is not the same as nInputPlane 3
        at scala.Predef$.require(Predef.scala:224)
        at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:262)
        at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:54)
        at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
        at com.intel.analytics.bigdl.nn.StaticGraph.updateOutput(StaticGraph.scala:59)
        at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:257)
        at com.intel.analytics.zoo.pipeline.nnframes.NNModel$$anonfun$2$$anonfun$apply$1$$anonfun$4.apply(NNEstimator.scala:531) 
jason-dai commented 5 years ago

There is a bug in BigDL 0.6 and fixed in 0.7; please try Analytics Zoo 0.3.0 with BigDL 0.7.1 (https://analytics-zoo.github.io/master/#release-download/#release-030)