intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.31k stars 1.23k forks source link

Error in training LSTM model #9097

Open gdg1212 opened 9 months ago

gdg1212 commented 9 months ago
val model = Sequential[Float]()
  .add(LSTM(inputSize = 3, hiddenSize = 50))
  .add(Linear(inputSize = 50, outputSize = 10))

// .add(LogSoftMax())

val optimizer = Optimizer(model = model,
  sampleRDD = data,
  criterion = MSECriterion[Float](),
  batchSize = 10)
optimizer
  .setOptimMethod(new Adam(0.01))
  .setEndWhen(Trigger.maxEpoch(10))
  .optimize()

data的格式是data: RDD[Sample[Float]]

训练模型报错 java.lang.ClassCastException: com.intel.analytics.bigdl.tensor.DenseTensor cannot be cast to com.intel.analytics.bigdl.utils.Table at com.intel.analytics.bigdl.nn.Cell.updateOutput(Cell.scala:48) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) 和 23/10/07 18:00:14 ERROR [Executor task launch worker for task 4.0 in stage 14.0 (TID 26)] Executor: Exception in task 4.0 in stage 14.0 (TID 26) com.intel.analytics.bigdl.utils.LayerException: null at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:288) ~[bigdl-SPARK_3.1-0.13.0.jar:?] at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39) ~[bigdl-SPARK_3.1-0.13.0.jar:?] at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) ~[bigdl-SPARK_3.1-0.13.0.jar:?]

gdg1212 commented 9 months ago

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 14.0 failed 1 times, most recent failure: Lost task 3.0 in stage 14.0 (TID 25) (master-1-1.c-52c86fc1cf6fe4b8.ap-southeast-5.emr.aliyuncs.com executor driver): Layer info: Sequential[929196ee]{ [input -> (1) -> (2) -> output] (1): LSTM(3, 50, 0.0) (2): Linear[ed0e8842](50 -> 10) }/LSTM(3, 50, 0.0) java.lang.ClassCastException: com.intel.analytics.bigdl.tensor.DenseTensor cannot be cast to com.intel.analytics.bigdl.utils.Table at com.intel.analytics.bigdl.nn.Cell.updateOutput(Cell.scala:48) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.optim.DistriOptimizer$.$anonfun$optimize$8(DistriOptimizer.scala:269) at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) at com.intel.analytics.bigdl.utils.ThreadPool$$anon$4.call(ThreadPool.scala:160) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:288)
    at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$.$anonfun$optimize$8(DistriOptimizer.scala:269)
    at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anon$4.call(ThreadPool.scala:160)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2712) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2648) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2647) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2647) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1189) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2900) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2842) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2831) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:959) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2323) at org.apache.spark.rdd.RDD.$anonfun$reduce$1(RDD.scala:1111) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1093) at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:353) at com.intel.analytics.bigdl.optim.DistriOptimizer.optimize(DistriOptimizer.scala:908) at LSTMDemo2$.main(LSTMDemo2.scala:112) at LSTMDemo2.main(LSTMDemo2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: Layer info: Sequential[929196ee]{ [input -> (1) -> (2) -> output] (1): LSTM(3, 50, 0.0) (2): Linear[ed0e8842](50 -> 10) }/LSTM(3, 50, 0.0) java.lang.ClassCastException: com.intel.analytics.bigdl.tensor.DenseTensor cannot be cast to com.intel.analytics.bigdl.utils.Table at com.intel.analytics.bigdl.nn.Cell.updateOutput(Cell.scala:48) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39) at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282) at com.intel.analytics.bigdl.optim.DistriOptimizer$.$anonfun$optimize$8(DistriOptimizer.scala:269) at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) at com.intel.analytics.bigdl.utils.ThreadPool$$anon$4.call(ThreadPool.scala:160) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:288)
    at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:39)
    at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:282)
    at com.intel.analytics.bigdl.optim.DistriOptimizer$.$anonfun$optimize$8(DistriOptimizer.scala:269)
    at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
    at com.intel.analytics.bigdl.utils.ThreadPool$$anon$4.call(ThreadPool.scala:160)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
gdg1212 commented 9 months ago
optimizer
  .setOptimMethod(new Adam(0.01))
  .setEndWhen(Trigger.maxEpoch(10))
  .optimize()

在setEndWhen(Trigger.maxEpoch(10))这一行报错

qiuxin2012 commented 9 months ago

LSTM should be add to a Recurrent, your model definition is wrong. You can see the model definition for help in this example https://github.com/intel-analytics/BigDL/tree/main/scala/dllib/src/main/scala/com/intel/analytics/bigdl/dllib/example/languagemodel