intel-machine-learning / DistML

DistML provide a supplement to mllib to support model-parallel on Spark
Other
167 stars 75 forks source link

ParamServerDriver job failure when task number set large #10

Open tanglizhe1105 opened 8 years ago

tanglizhe1105 commented 8 years ago

Vocabulary: 25970 Docs: 1000 Tokens: 106776 Topics: 1000

cluster has 20 servers, each server has 8 core cpub, 48GB mem. when set --psCount 20, lda work well set --psCount 40 , lda also work well but try to set --psCount 60, some tasks of Parameter server jop will failure.

log as following:

java.lang.NegativeArraySizeException

Job aborted due to stage failure: Task 119 in stage 4.0 failed 4 times, most recent failure: Lost task 119.3 in stage 4.0 (TID 147, node-26): java.lang.NegativeArraySizeException
    at com.intel.distml.util.store.IntArrayStore.init(IntArrayStore.java:31)
    at com.intel.distml.util.DataStore.createStore(DataStore.java:56)
    at com.intel.distml.util.DataStore.createStores(DataStore.java:44)
    at com.intel.distml.platform.ParamServerDriver.paramServerTask(ParamServerDriver.scala:44)
    at com.intel.distml.platform.ParamServerDriver$$anonfun$run$3.apply(ParamServerDriver.scala:75)
    at com.intel.distml.platform.ParamServerDriver$$anonfun$run$3.apply(ParamServerDriver.scala:75)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
tanglizhe1105 commented 8 years ago

It seems the psCount should not more than excutor number, but if not set psCount, may can not processing large scala corpus, such as doc more than million, and word more than million!

Thx

tanglizhe1105 commented 8 years ago

I have decided the reason of this bug, that is the model size(such as array size of IntArrayWithIntKey, or row of IntMatrixWithIntKey) could not less than --pscount. detail please see here