intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.71k stars 1.26k forks source link

Error in training LSTM model #9097

Open gdg1212 opened 1 year ago

gdg1212 commented 1 year ago
val model = Sequential[Float]()
  .add(LSTM(inputSize = 3, hiddenSize = 50))
  .add(Linear(inputSize = 50, outputSize = 10))

// .add(LogSoftMax())

val optimizer = Optimizer(model = model,
  sampleRDD = data,
  criterion = MSECriterion[Float](),
  batchSize = 10)
optimizer
  .setOptimMethod(new Adam(0.01))
  .setEndWhen(Trigger.maxEpoch(10))
  .optimize()

data的格式是data: RDD[Sample[Float]]

训练模型报错 java.lang.ClassCastException: com.intel.analytics.bigdl.tensor.DenseTensor cannot be cast to com.intel.analytics.bigdl.utils.Table at com.intel.analytics.bigdl.nn.Cell.updateOutput(Cell.scala:48) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) 和 23/10/07 18:00:14 ERROR [Executor task launch worker for task 4.0 in stage 14.0 (TID 26)] Executor: Exception in task 4.0 in stage 14.0 (TID 26) com.intel.analytics.bigdl.utils.LayerException: null at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:288) ~[bigdl-SPARK_3.1-0.13.0.jar:?] at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39) ~[bigdl-SPARK_3.1-0.13.0.jar:?] at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) ~[bigdl-SPARK_3.1-0.13.0.jar:?]

gdg1212 commented 1 year ago

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 14.0 failed 1 times, most recent failure: Lost task 3.0 in stage 14.0 (TID 25) (master-1-1.c-52c86fc1cf6fe4b8.ap-southeast-5.emr.aliyuncs.com executor driver): Layer info: Sequential[929196ee]{ [input -> (1) -> (2) -> output] (1): LSTM(3, 50, 0.0) (2): Linear[ed0e8842](50 -> 10) }/LSTM(3, 50, 0.0) java.lang.ClassCastException: com.intel.analytics.bigdl.tensor.DenseTensor cannot be cast to com.intel.analytics.bigdl.utils.Table at com.intel.analytics.bigdl.nn.Cell.updateOutput(Cell.scala:48) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.optim.DistriOptimizer$.$anonfun$optimize$8(DistriOptimizer.scala:269) at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) at com.intel.analytics.bigdl.utils.ThreadPool$$anon$4.call(ThreadPool.scala:160) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:288)
    at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$.$anonfun$optimize$8(DistriOptimizer.scala:269)
    at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anon$4.call(ThreadPool.scala:160)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2712) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2648) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2647) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2647) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1189) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2900) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2842) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2831) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:959) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2323) at org.apache.spark.rdd.RDD.$anonfun$reduce$1(RDD.scala:1111) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1093) at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:353) at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:908) at LSTMDemo2$.main(LSTMDemo2.scala:112) at LSTMDemo2.main(LSTMDemo2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: Layer info: Sequential[929196ee]{ [input -> (1) -> (2) -> output] (1): LSTM(3, 50, 0.0) (2): Linear[ed0e8842](50 -> 10) }/LSTM(3, 50, 0.0) java.lang.ClassCastException: com.intel.analytics.bigdl.tensor.DenseTensor cannot be cast to com.intel.analytics.bigdl.utils.Table at com.intel.analytics.bigdl.nn.Cell.updateOutput(Cell.scala:48) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.optim.DistriOptimizer$.$anonfun$optimize$8(DistriOptimizer.scala:269) at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) at com.intel.analytics.bigdl.utils.ThreadPool$$anon$4.call(ThreadPool.scala:160) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:288)
    at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$.$anonfun$optimize$8(DistriOptimizer.scala:269)
    at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anon$4.call(ThreadPool.scala:160)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
gdg1212 commented 1 year ago
optimizer
  .setOptimMethod(new Adam(0.01))
  .setEndWhen(Trigger.maxEpoch(10))
  .optimize()

在setEndWhen(Trigger.maxEpoch(10))这一行报错

qiuxin2012 commented 1 year ago

LSTM should be add to a Recurrent, your model definition is wrong. You can see the model definition for help in this example https://github.com/intel-analytics/BigDL/tree/main/scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/languagemodel