intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

'Boxed Error' & 'java.lang.OutOfMemoryError: Java heap space' when train the model in WideAndDeep example #1016

Closed zjdx1998 closed 5 years ago

zjdx1998 commented 5 years ago

I changed the data in the example and modified the column_info. The other settings are basically the same as the Wide-And-Deep example. Why is Boxed Error appearing?

The basic environment version follows:

analytics-zoo = 0.5.1
spark & pyspark = 2.4.3
java = openjdk_1.8

And I run the code in Google Colab. When training the model below, the error threw:

%%time
# Boot training process
wide_n_deep.fit(train_data,
        batch_size = 294,
        nb_epoch = 10,
        validation_data = test_data
        )
print("Optimization Done.")

Here paste log:

2019-08-18 15:10:02 ERROR DistriOptimizer$:894 - Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 168.0 failed 1 times, most recent failure: Lost task 0.0 in stage 168.0 (TID 178, localhost, executor driver): java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: Boxed Error
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:284)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:284)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:284)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:212)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: Boxed Error
    at scala.concurrent.impl.Promise$.resolver(Promise.scala:59)
    at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:51)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.complete(Promise.scala:55)
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anon$2.execute(ThreadPool.scala:235)
    at scala.concurrent.impl.Future$.apply(Future.scala:31)
    at scala.concurrent.Future$.apply(Future.scala:494)
    at com.intel.analytics.bigdl.utils.ThreadPool.invoke(ThreadPool.scala:200)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion.updateOutput(ClassNLLCriterion.scala:130)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion.updateOutput(ClassNLLCriterion.scala:69)
    at com.intel.analytics.zoo.pipeline.api.keras.objectives.LossFunction.updateOutput(LossFunction.scala:37)
    at com.intel.analytics.zoo.pipeline.api.keras.objectives.SparseCategoricalCrossEntropy.updateOutput(SparseCategoricalCrossEntropy.scala:64)
    at com.intel.analytics.zoo.pipeline.api.keras.objectives.SparseCategoricalCrossEntropy.updateOutput(SparseCategoricalCrossEntropy.scala:47)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractCriterion.forward(AbstractCriterion.scala:73)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:265)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:255)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:255)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$5.call(ThreadPool.scala:144)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more
Caused by: java.lang.AssertionError: assertion failed: curTarget 14 is out of range 1 to 5
    at scala.Predef$.assert(Predef.scala:170)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion$$anonfun$updateOutput$5.apply(ClassNLLCriterion.scala:132)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion$$anonfun$updateOutput$5.apply(ClassNLLCriterion.scala:130)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invoke$2.apply(ThreadPool.scala:194)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    ... 18 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1035)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:1017)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:342)
    at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:869)
    at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:363)
    at com.intel.analytics.zoo.pipeline.api.keras.models.KerasNet.fit(Topology.scala:421)
    at com.intel.analytics.zoo.pipeline.api.keras.python.PythonZooKeras.zooFit(PythonZooKeras.scala:104)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: Boxed Error
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:284)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$8.apply(DistriOptimizer.scala:284)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:284)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:212)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.util.concurrent.ExecutionException: Boxed Error
    at scala.concurrent.impl.Promise$.resolver(Promise.scala:59)
    at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:51)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at scala.concurrent.Promise$class.complete(Promise.scala:55)
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anon$2.execute(ThreadPool.scala:235)
    at scala.concurrent.impl.Future$.apply(Future.scala:31)
    at scala.concurrent.Future$.apply(Future.scala:494)
    at com.intel.analytics.bigdl.utils.ThreadPool.invoke(ThreadPool.scala:200)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion.updateOutput(ClassNLLCriterion.scala:130)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion.updateOutput(ClassNLLCriterion.scala:69)
    at com.intel.analytics.zoo.pipeline.api.keras.objectives.LossFunction.updateOutput(LossFunction.scala:37)
    at com.intel.analytics.zoo.pipeline.api.keras.objectives.SparseCategoricalCrossEntropy.updateOutput(SparseCategoricalCrossEntropy.scala:64)
    at com.intel.analytics.zoo.pipeline.api.keras.objectives.SparseCategoricalCrossEntropy.updateOutput(SparseCategoricalCrossEntropy.scala:47)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractCriterion.forward(AbstractCriterion.scala:73)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:265)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:255)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4$$anonfun$6$$anonfun$apply$2.apply(DistriOptimizer.scala:255)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$1$$anon$5.call(ThreadPool.scala:144)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more
Caused by: java.lang.AssertionError: assertion failed: curTarget 14 is out of range 1 to 5
    at scala.Predef$.assert(Predef.scala:170)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion$$anonfun$updateOutput$5.apply(ClassNLLCriterion.scala:132)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion$$anonfun$updateOutput$5.apply(ClassNLLCriterion.scala:130)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invoke$2.apply(ThreadPool.scala:194)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    ... 18 more
hkvision commented 5 years ago

Hi @zjdx1998 I think the error message that makes more sense is the last one

Caused by: java.lang.AssertionError: assertion failed: curTarget 14 is out of range 1 to 5
    at scala.Predef$.assert(Predef.scala:170)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion$$anonfun$updateOutput$5.apply(ClassNLLCriterion.scala:132)
    at com.intel.analytics.bigdl.nn.ClassNLLCriterion$$anonfun$updateOutput$5.apply(ClassNLLCriterion.scala:130)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anonfun$invoke$2.apply(ThreadPool.scala:194)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    ... 18 more

In our example on movenlens dataset, the ratings (i.e. label) range from 1 to 5. Seems now you have label 14 but when you create the model, you are still specifying the number of classes to be 5. Could you please have a check?

zjdx1998 commented 5 years ago

Thanks @hkvision And by the way, I would like to know how to improve driver.memory on google colab, because I am now experiencing a new error, which I tried the following solutions but not worked:

1.
memory = '32g' 
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

2.
!export SPARK_DRIVER_MEMORY = 32g

3.
sc.stop()
_tconf = sc.getConf()
_tconf.set('spark.driver.memory', '32g')
sc = init_nncontext(_tconf) #this report errors that exist two SparkContext

Could you please help me about this? Thanks very much!

This is the error info:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:34893)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
  File "/usr/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused

Py4JNetworkErrorTraceback (most recent call last)
<ipython-input-59-0a9cdafefb39> in <module>()
----> 1 get_ipython().run_cell_magic(u'time', u'', u'# Boot training process\nwide_n_deep.fit(train_data,\n        batch_size = 294,\n        nb_epoch = 8,\n        validation_data = test_data\n        )\nprint("Optimization Done.")')

5 frames
</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-60> in time(self, line, cell, local_ns)

<timed exec> in <module>()

/usr/local/lib/python2.7/dist-packages/bigdl/util/common.pyc in callBigDlFunc(bigdl_type, name, *args)
    587             error = e
    588             if "does not exist" not in str(e):
--> 589                 raise e
    590         else:
    591             return result

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:34893)

And the bigdl.log :

*********
2019-08-19 02:42:10 INFO  DistriOptimizer$:181 - [Epoch 7 588/295][Iteration 14][Wall Clock 4.787957025s] Top1Accuracy is Accuracy(correct: 12, count: 103, accuracy: 0.11650485436893204)
2019-08-19 02:42:10 INFO  DistriOptimizer$:408 - [Epoch 8 294/295][Iteration 15][Wall Clock 5.095077726s] Trained 294 records in 0.256534343 seconds. Throughput is 1146.0454 records/second. Loss is 16.97959. 
2019-08-19 02:42:10 INFO  DistriOptimizer$:408 - [Epoch 8 294/295][Iteration 15][Wall Clock 5.095077726s] Trained 294 records in 0.256534343 seconds. Throughput is 1146.0454 records/second. Loss is 16.97959. 
2019-08-19 02:42:11 INFO  DistriOptimizer$:408 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Trained 294 records in 0.24451292 seconds. Throughput is 1202.3905 records/second. Loss is 16.97959. 
2019-08-19 02:42:11 INFO  DistriOptimizer$:408 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Trained 294 records in 0.24451292 seconds. Throughput is 1202.3905 records/second. Loss is 16.97959. 
2019-08-19 02:42:11 INFO  DistriOptimizer$:452 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Epoch finished. Wall clock time is 5391.449468 ms
2019-08-19 02:42:11 INFO  DistriOptimizer$:452 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Epoch finished. Wall clock time is 5391.449468 ms
2019-08-19 02:42:11 INFO  DistriOptimizer$:111 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Validate model...
2019-08-19 02:42:11 INFO  DistriOptimizer$:111 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Validate model...
2019-08-19 02:42:11 INFO  DistriOptimizer$:178 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] validate model throughput is 48.515327 records/second
2019-08-19 02:42:11 INFO  DistriOptimizer$:178 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] validate model throughput is 48.515327 records/second
2019-08-19 02:42:11 INFO  DistriOptimizer$:181 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Loss is (Loss: 32.541798, count: 2, Average Loss: 16.270899)
2019-08-19 02:42:11 INFO  DistriOptimizer$:181 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Loss is (Loss: 32.541798, count: 2, Average Loss: 16.270899)
2019-08-19 02:42:11 INFO  DistriOptimizer$:181 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Top1Accuracy is Accuracy(correct: 12, count: 103, accuracy: 0.11650485436893204)
2019-08-19 02:42:11 INFO  DistriOptimizer$:181 - [Epoch 8 588/295][Iteration 16][Wall Clock 5.339590646s] Top1Accuracy is Accuracy(correct: 12, count: 103, accuracy: 0.11650485436893204)
2019-08-19 02:42:12 ERROR Executor:91 - Exception in task 0.0 in stage 149.0 (TID 59)
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
    at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:260)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:50)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
    at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:48)
    at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:517)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2019-08-19 02:42:12 ERROR Executor:91 - Exception in task 0.0 in stage 149.0 (TID 59)
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
    at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:260)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:50)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
    at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:48)
    at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:517)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2019-08-19 02:42:12 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[Executor task launch worker for task 59,5,main]
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
    at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:260)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:50)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
    at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:48)
    at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:517)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2019-08-19 02:42:12 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[Executor task launch worker for task 59,5,main]
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
    at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:260)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:50)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
    at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:48)
    at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:517)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2019-08-19 02:42:12 WARN  TaskSetManager:66 - Lost task 0.0 in stage 149.0 (TID 59, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
    at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:260)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:50)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
    at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:48)
    at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:517)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

2019-08-19 02:42:12 WARN  TaskSetManager:66 - Lost task 0.0 in stage 149.0 (TID 59, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1853)
    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709)
    at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:260)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:50)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply(TaskResult.scala:48)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
    at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:48)
    at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:517)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

2019-08-19 02:42:12 ERROR TaskSetManager:70 - Task 0 in stage 149.0 failed 1 times; aborting job
2019-08-19 02:42:12 ERROR TaskSetManager:70 - Task 0 in stage 149.0 failed 1 times; aborting job
zjdx1998 commented 5 years ago

Actually, I set the spark config like this:

sc.getConf().getAll()
sc.stop()
_conf = sc.getConf().set('spark.driver.memory','32g')
_conf.set('spark.executor.memory','32g')
_conf.set('spark.driver.maxResultSize','32g')
_conf.getAll()
sc = init_nncontext(conf=_conf)
sc.getConf().getAll()

And got the new config:

[(u'spark.executorEnv.OMP_NUM_THREADS', u'1'),
 (u'spark.serializer', u'org.apache.spark.serializer.JavaSerializer'),
 (u'spark.driver.memory', u'32g'),
 (u'spark.driver.port', u'33559'),
 (u'spark.driver.maxResultSize', u'32g'),
 (u'spark.shuffle.reduceLocality.enabled', u'false'),
 (u'spark.executor.id', u'driver'),
 (u'spark.shuffle.blockTransferService', u'nio'),
 (u'spark.executorEnv.KMP_BLOCKTIME', u'0'),
 (u'spark.driver.extraClassPath',
  u'/usr/local/lib/python2.7/dist-packages/bigdl/share/lib/bigdl-0.8.0-jar-with-dependencies.jar:/usr/local/lib/python2.7/dist-packages/zoo/share/lib/analytics-zoo-bigdl_0.8.0-spark_2.4.3-0.5.1-jar-with-dependencies.jar'),
 (u'spark.executorEnv.KMP_AFFINITY', u'granularity=fine,compact,1,0'),
 (u'spark.app.name', u'WideAndDeep JobRecommendation'),
 (u'spark.executor.memory', u'32g'),
 (u'spark.app.id', u'local-1566201703722'),
 (u'spark.driver.host', u'11a156303bf1'),
 (u'spark.rdd.compress', u'True'),
 (u'spark.speculation', u'false'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.master', u'local[*]'),
 (u'spark.scheduler.minRegisteredResourcesRatio', u'1.0'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.ui.showConsoleProgress', u'true'),
 (u'spark.executorEnv.KMP_SETTINGS', u'1')]

But it still didn't work in the last step : wide_n_deep.recommend_for_user() which reports the errors in my last comment, and I check the jvm heap size and get this:

Attaching to process ID 453, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.212-b03

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 0
   MaxHeapFreeRatio         = 100
   MaxHeapSize              = 1073741824 (1024.0MB)            ***watch out here***
   NewSize                  = 138412032 (132.0MB)
   MaxNewSize               = 357564416 (341.0MB)
   OldSize                  = 276824064 (264.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.tools.jmap.JMap.runTool(JMap.java:201)
    at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: java.lang.RuntimeException: unknown CollectedHeap type : class sun.jvm.hotspot.gc_interface.CollectedHeap
    at sun.jvm.hotspot.tools.HeapSummary.run(HeapSummary.java:144)
    at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
    at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
    at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
    at sun.jvm.hotspot.tools.HeapSummary.main(HeapSummary.java:49)
    ... 6 more

It frustrated me.

hkvision commented 5 years ago

So you have successfully trained the model, and the error only happens when you call recommend_for_user?

zjdx1998 commented 5 years ago

So you have successfully trained the model, and the error only happens when you call recommend_for_user?

Yes, but when I changed the value of spark.driver|executor.memory|maxResultSize, training process failed again. I maintained the config in the last comment and successfully trained the model, but failed when I call recommend_for_user Thanks for your replying!

hkvision commented 5 years ago

So you have successfully trained the model, and the error only happens when you call recommend_for_user?

Yes, but when I changed the value of spark.driver|executor.memory|maxResultSize, training process failed again. I maintained the config in the last comment and successfully trained the model, but failed when I call recommend_for_user Thanks for your replying!

change the value of spark.driver.memory you mean expanding the memory but the training process failed? I noticed from your error that you only have 200+ training data? What about your test data that fed into recommend_for_user? (Maybe you can further reduce the data size to investigate what happens?)

Also, at the same time, can you run our notebook on WideAndDeep here: https://github.com/intel-analytics/analytics-zoo/tree/master/apps/recommendation-wide-n-deep ?

zjdx1998 commented 5 years ago

So you have successfully trained the model, and the error only happens when you call recommend_for_user?

Yes, but when I changed the value of spark.driver|executor.memory|maxResultSize, training process failed again. I maintained the config in the last comment and successfully trained the model, but failed when I call recommend_for_user Thanks for your replying!

change the value of spark.driver.memory you mean expanding the memory but the training process failed? I noticed from your error that you only have 200+ training data? What about your test data that fed into recommend_for_user? (Maybe you can further reduce the data size to investigate what happens?)

Also, at the same time, can you run our notebook on WideAndDeep here: https://github.com/intel-analytics/analytics-zoo/tree/master/apps/recommendation-wide-n-deep ?

Thanks for reply

  1. Yes, I expand the memory to 40g but trained failed.
  2. I used to have more than 100,000 data in the past, but now I only choose to keep 200 for testing. According to your suggestion, I reduced the amount of data to 150 and ran twice, but still fail twice

one is java.lang.OutOfMemoryError: Java heap space another is

Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 290, in _handle_request_noblock
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
  1. I can run the example notebook in my computer (with !export SPARK_DRIVER_MEMORY=32g) but can't run it in Google Colab(with set config of spar.driver.memory = 32g)
hkvision commented 5 years ago

So you have successfully trained the model, and the error only happens when you call recommend_for_user?

Yes, but when I changed the value of spark.driver|executor.memory|maxResultSize, training process failed again. I maintained the config in the last comment and successfully trained the model, but failed when I call recommend_for_user Thanks for your replying!

change the value of spark.driver.memory you mean expanding the memory but the training process failed? I noticed from your error that you only have 200+ training data? What about your test data that fed into recommend_for_user? (Maybe you can further reduce the data size to investigate what happens?) Also, at the same time, can you run our notebook on WideAndDeep here: https://github.com/intel-analytics/analytics-zoo/tree/master/apps/recommendation-wide-n-deep ?

Thanks for reply

  1. Yes, I expand the memory to 40g but trained failed.
  2. I used to have more than 100,000 data in the past, but now I only choose to keep 200 for testing. According to your suggestion, I reduced the amount of data to 150 and ran twice, but still fail twice

one is java.lang.OutOfMemoryError: Java heap space another is

Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 290, in _handle_request_noblock
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
  1. I can run the example notebook in my computer (with !export SPARK_DRIVER_MEMORY=32g) but can't run it in Google Colab(with set config of spar.driver.memory = 32g)

Then we suppose there is something wrong with Google Colab configurations? Could you check the memory it actually allocates for you? Or you can seek them for some help?

glorysdj commented 5 years ago

it seems you can check the memory here

image

hkvision commented 5 years ago

I will first close this issue. Feel free to reopen it if there are further problems. @zjdx1998