intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.49k stars 1.24k forks source link

Low Accuracy on Sentiment Analysis LSTM - GloVe20News #2733

Open Bfzanchetta opened 5 years ago

Bfzanchetta commented 5 years ago

I'm attempting to train the given Text Classifier with LSTM instead of CNN on a 8-workers BigDL's cluster. However, the training unveils a very low accuracy rate. Here's the print from the last attempt:

19/02/11 21:54:28 INFO DistriOptimizer$: [Epoch 13 15360/14923][Iteration 260][Wall Clock 597.135520654s] Epoch finished. Wall clock time is 604099.486244 ms
19/02/11 21:54:28 INFO DistriOptimizer$: [Epoch 13 15360/14923][Iteration 260][Wall Clock 597.135520654s] Validate model...
19/02/11 21:54:36 INFO DistriOptimizer$: [Epoch 13 15360/14923][Iteration 260][Wall Clock 597.135520654s] Top1Accuracy is Accuracy(correct: 226, count: 3905, accuracy: 0.05787451984635083)
19/02/11 21:54:38 INFO DistriOptimizer$: [Epoch 14 768/14923][Iteration 261][Wall Clock 605.903340897s] Trained 768 records in 1.803854653 seconds. Throughput is 425.7549 records/second. Loss is 2.9859028. 
19/02/11 21:54:40 INFO DistriOptimizer$: [Epoch 14 1536/14923][Iteration 262][Wall Clock 607.672976053s] Trained 768 records in 1.769635156 seconds. Throughput is 433.98776 records/second. Loss is 2.9682853. 
19/02/11 21:54:41 INFO DistriOptimizer$: [Epoch 14 2304/14923][Iteration 263][Wall Clock 609.393950892s] Trained 768 records in 1.720974839 seconds. Throughput is 446.2587 records/second. Loss is 2.9743903. 
19/02/11 21:54:43 INFO DistriOptimizer$: [Epoch 14 3072/14923][Iteration 264][Wall Clock 611.126353563s] Trained 768 records in 1.732402671 seconds. Throughput is 443.31494 records/second. Loss is 2.987369. 
19/02/11 21:54:45 INFO DistriOptimizer$: [Epoch 14 3840/14923][Iteration 265][Wall Clock 612.699872205s] Trained 768 records in 1.573518642 seconds. Throughput is 488.0781 records/second. Loss is 2.9755714. 
19/02/11 21:54:46 INFO DistriOptimizer$: [Epoch 14 4608/14923][Iteration 266][Wall Clock 614.428942279s] Trained 768 records in 1.729070074 seconds. Throughput is 444.1694 records/second. Loss is 2.9765773. 
19/02/11 21:54:48 INFO DistriOptimizer$: [Epoch 14 5376/14923][Iteration 267][Wall Clock 616.107890193s] Trained 768 records in 1.678947914 seconds. Throughput is 457.42932 records/second. Loss is 2.9611878. 
19/02/11 21:54:50 INFO DistriOptimizer$: [Epoch 14 6144/14923][Iteration 268][Wall Clock 617.76562979s] Trained 768 records in 1.657739597 seconds. Throughput is 463.28143 records/second. Loss is 2.9703681. 
19/02/11 21:54:51 INFO DistriOptimizer$: [Epoch 14 6912/14923][Iteration 269][Wall Clock 619.341088789s] Trained 768 records in 1.575458999 seconds. Throughput is 487.477 records/second. Loss is 2.974666. 
19/02/11 21:54:53 INFO DistriOptimizer$: [Epoch 14 7680/14923][Iteration 270][Wall Clock 621.086098227s] Trained 768 records in 1.745009438 seconds. Throughput is 440.11224 records/second. Loss is 2.976882. 
19/02/11 21:54:55 INFO DistriOptimizer$: [Epoch 14 8448/14923][Iteration 271][Wall Clock 622.767986926s] Trained 768 records in 1.681888699 seconds. Throughput is 456.62952 records/second. Loss is 2.9662917. 
19/02/11 21:54:56 INFO DistriOptimizer$: [Epoch 14 9216/14923][Iteration 272][Wall Clock 624.401519917s] Trained 768 records in 1.633532991 seconds. Throughput is 470.1466 records/second. Loss is 2.9706428. 
19/02/11 21:54:58 INFO DistriOptimizer$: [Epoch 14 9984/14923][Iteration 273][Wall Clock 626.105024646s] Trained 768 records in 1.703504729 seconds. Throughput is 450.83527 records/second. Loss is 2.9820547. 
19/02/11 21:55:00 INFO DistriOptimizer$: [Epoch 14 10752/14923][Iteration 274][Wall Clock 627.756389557s] Trained 768 records in 1.651364911 seconds. Throughput is 465.06982 records/second. Loss is 2.973231. 
19/02/11 21:55:01 INFO DistriOptimizer$: [Epoch 14 11520/14923][Iteration 275][Wall Clock 629.426095402s] Trained 768 records in 1.669705845 seconds. Throughput is 459.96124 records/second. Loss is 2.9795823. 
19/02/11 21:55:03 INFO DistriOptimizer$: [Epoch 14 12288/14923][Iteration 276][Wall Clock 631.182570675s] Trained 768 records in 1.756475273 seconds. Throughput is 437.2393 records/second. Loss is 2.9676907. 
19/02/11 21:55:05 INFO DistriOptimizer$: [Epoch 14 13056/14923][Iteration 277][Wall Clock 632.904944054s] Trained 768 records in 1.722373379 seconds. Throughput is 445.89636 records/second. Loss is 2.9707906. 
19/02/11 21:55:07 INFO DistriOptimizer$: [Epoch 14 13824/14923][Iteration 278][Wall Clock 634.620078722s] Trained 768 records in 1.715134668 seconds. Throughput is 447.77826 records/second. Loss is 2.980821. 
19/02/11 21:55:08 INFO DistriOptimizer$: [Epoch 14 14592/14923][Iteration 279][Wall Clock 636.389390068s] Trained 768 records in 1.769311346 seconds. Throughput is 434.0672 records/second. Loss is 2.9798176. 
19/02/11 21:55:10 INFO DistriOptimizer$: [Epoch 14 15360/14923][Iteration 280][Wall Clock 638.16818019s] Trained 768 records in 1.778790122 seconds. Throughput is 431.75415 records/second. Loss is 2.9808552. 
19/02/11 21:55:10 INFO DistriOptimizer$: [Epoch 14 15360/14923][Iteration 280][Wall Clock 638.16818019s] Epoch finished. Wall clock time is 646326.482571 ms
19/02/11 21:55:10 INFO DistriOptimizer$: [Epoch 14 15360/14923][Iteration 280][Wall Clock 638.16818019s] Validate model...
19/02/11 21:55:17 INFO DistriOptimizer$: [Epoch 14 15360/14923][Iteration 280][Wall Clock 638.16818019s] Top1Accuracy is Accuracy(correct: 224, count: 3905, accuracy: 0.05736235595390525)
19/02/11 21:55:19 INFO DistriOptimizer$: [Epoch 15 768/14923][Iteration 281][Wall Clock 648.14874026s] Trained 768 records in 1.822257689 seconds. Throughput is 421.45523 records/second. Loss is 2.984185. 
19/02/11 21:55:21 INFO DistriOptimizer$: [Epoch 15 1536/14923][Iteration 282][Wall Clock 649.942929731s] Trained 768 records in 1.794189471 seconds. Throughput is 428.04843 records/second. Loss is 2.9798527. 
19/02/11 21:55:22 INFO DistriOptimizer$: [Epoch 15 2304/14923][Iteration 283][Wall Clock 651.61417292s] Trained 768 records in 1.671243189 seconds. Throughput is 459.53815 records/second. Loss is 2.974022. 
19/02/11 21:55:24 INFO DistriOptimizer$: [Epoch 15 3072/14923][Iteration 284][Wall Clock 653.355413848s] Trained 768 records in 1.741240928 seconds. Throughput is 441.06473 records/second. Loss is 2.97656. 
19/02/11 21:55:26 INFO DistriOptimizer$: [Epoch 15 3840/14923][Iteration 285][Wall Clock 655.045011572s] Trained 768 records in 1.689597724 seconds. Throughput is 454.54608 records/second. Loss is 2.9848154. 
19/02/11 21:55:28 INFO DistriOptimizer$: [Epoch 15 4608/14923][Iteration 286][Wall Clock 656.788181856s] Trained 768 records in 1.743170284 seconds. Throughput is 440.5766 records/second. Loss is 2.9894907. 
19/02/11 21:55:29 INFO DistriOptimizer$: [Epoch 15 5376/14923][Iteration 287][Wall Clock 658.480493859s] Trained 768 records in 1.692312003 seconds. Throughput is 453.81702 records/second. Loss is 2.9772322. 
19/02/11 21:55:31 INFO DistriOptimizer$: [Epoch 15 6144/14923][Iteration 288][Wall Clock 660.166880089s] Trained 768 records in 1.68638623 seconds. Throughput is 455.41168 records/second. Loss is 2.984101. 
19/02/11 21:55:33 INFO DistriOptimizer$: [Epoch 15 6912/14923][Iteration 289][Wall Clock 661.85638721s] Trained 768 records in 1.689507121 seconds. Throughput is 454.57043 records/second. Loss is 2.982938. 
19/02/11 21:55:34 INFO DistriOptimizer$: [Epoch 15 7680/14923][Iteration 290][Wall Clock 663.620091493s] Trained 768 records in 1.763704283 seconds. Throughput is 435.44714 records/second. Loss is 2.97358. 
19/02/11 21:55:36 INFO DistriOptimizer$: [Epoch 15 8448/14923][Iteration 291][Wall Clock 665.260683388s] Trained 768 records in 1.640591895 seconds. Throughput is 468.12375 records/second. Loss is 2.966507. 
19/02/11 21:55:38 INFO DistriOptimizer$: [Epoch 15 9216/14923][Iteration 292][Wall Clock 666.960838219s] Trained 768 records in 1.700154831 seconds. Throughput is 451.72357 records/second. Loss is 2.9650447. 
19/02/11 21:55:39 INFO DistriOptimizer$: [Epoch 15 9984/14923][Iteration 293][Wall Clock 668.672705791s] Trained 768 records in 1.711867572 seconds. Throughput is 448.63284 records/second. Loss is 2.9792418. 
19/02/11 21:55:41 INFO DistriOptimizer$: [Epoch 15 10752/14923][Iteration 294][Wall Clock 670.321723432s] Trained 768 records in 1.649017641 seconds. Throughput is 465.7318 records/second. Loss is 2.9735532. 
19/02/11 21:55:43 INFO DistriOptimizer$: [Epoch 15 11520/14923][Iteration 295][Wall Clock 672.04684509s] Trained 768 records in 1.725121658 seconds. Throughput is 445.186 records/second. Loss is 2.9840238. 
19/02/11 21:55:45 INFO DistriOptimizer$: [Epoch 15 12288/14923][Iteration 296][Wall Clock 673.80154841s] Trained 768 records in 1.75470332 seconds. Throughput is 437.68085 records/second. Loss is 2.9751852. 
19/02/11 21:55:46 INFO DistriOptimizer$: [Epoch 15 13056/14923][Iteration 297][Wall Clock 675.43471919s] Trained 768 records in 1.63317078 seconds. Throughput is 470.2509 records/second. Loss is 2.9777195. 
19/02/11 21:55:48 INFO DistriOptimizer$: [Epoch 15 13824/14923][Iteration 298][Wall Clock 677.025826644s] Trained 768 records in 1.591107454 seconds. Throughput is 482.68268 records/second. Loss is 2.9705133. 
19/02/11 21:55:50 INFO DistriOptimizer$: [Epoch 15 14592/14923][Iteration 299][Wall Clock 678.736325676s] Trained 768 records in 1.710499032 seconds. Throughput is 448.99176 records/second. Loss is 2.9635608. 
19/02/11 21:55:51 INFO DistriOptimizer$: [Epoch 15 15360/14923][Iteration 300][Wall Clock 680.38851991s] Trained 768 records in 1.652194234 seconds. Throughput is 464.8364 records/second. Loss is 2.972312. 
19/02/11 21:55:51 INFO DistriOptimizer$: [Epoch 15 15360/14923][Iteration 300][Wall Clock 680.38851991s] Epoch finished. Wall clock time is 687604.905848 ms
19/02/11 21:55:51 INFO DistriOptimizer$: [Epoch 15 15360/14923][Iteration 300][Wall Clock 680.38851991s] Validate model...
19/02/11 21:55:58 INFO DistriOptimizer$: [Epoch 15 15360/14923][Iteration 300][Wall Clock 680.38851991s] Top1Accuracy is Accuracy(correct: 213, count: 3905, accuracy: 0.05454545454545454)
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.disk.diskSpaceUsed_MB, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.maxMem_MB, value=102145
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.maxOffHeapMem_MB, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.maxOnHeapMem_MB, value=102145
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.memUsed_MB, value=10153
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.offHeapMemUsed_MB, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.onHeapMemUsed_MB, value=10153
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.remainingMem_MB, value=91991
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.remainingOffHeapMem_MB, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.BlockManager.memory.remainingOnHeapMem_MB, value=91991
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.DAGScheduler.job.activeJobs, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.DAGScheduler.job.allJobs, value=640
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.DAGScheduler.stage.failedStages, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.DAGScheduler.stage.runningStages, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.DAGScheduler.stage.waitingStages, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.PS-MarkSweep.count, value=4
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.PS-MarkSweep.time, value=363
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.PS-Scavenge.count, value=14
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.PS-Scavenge.time, value=437
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.heap.committed, value=4083154944
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.heap.init, value=924844032
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.heap.max, value=28633464832
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.heap.usage, value=0.052959056715529244
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.heap.used, value=1516401288
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.non-heap.committed, value=131137536
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.non-heap.init, value=2555904
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.non-heap.max, value=-1
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.non-heap.usage, value=-1.29285344E8
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.non-heap.used, value=129285408
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Code-Cache.committed, value=38338560
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Code-Cache.init, value=2555904
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Code-Cache.max, value=251658240
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Code-Cache.usage, value=0.15116780598958332
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Code-Cache.used, value=38042624
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Compressed-Class-Space.committed, value=11010048
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Compressed-Class-Space.init, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Compressed-Class-Space.max, value=1073741824
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Compressed-Class-Space.usage, value=0.009983859956264496
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Compressed-Class-Space.used, value=10720088
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Metaspace.committed, value=81788928
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Metaspace.init, value=0
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Metaspace.max, value=-1
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Metaspace.usage, value=0.9845225994403546
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.Metaspace.used, value=80523048
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Eden-Space.committed, value=2864709632
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Eden-Space.init, value=231735296
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Eden-Space.max, value=10462691328
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Eden-Space.usage, value=0.10167260857186591
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Eden-Space.used, value=1063769120
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Old-Gen.committed, value=1132986368
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Old-Gen.init, value=616562688
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Old-Gen.max, value=21474836480
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Old-Gen.usage, value=0.019513317197561265
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Old-Gen.used, value=419045296
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Survivor-Space.committed, value=85458944
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Survivor-Space.init, value=38273024
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Survivor-Space.max, value=85458944
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Survivor-Space.usage, value=0.3930176342923217
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.pools.PS-Survivor-Space.used, value=33586872
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.total.committed, value=4214292480
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.total.init, value=927399936
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.total.max, value=28633464831
19/02/11 21:56:00 INFO metrics: type=GAUGE, name=application_1549761699712_0010.driver.jvm.total.used, value=1645870696
19/02/11 21:56:00 INFO metrics: type=COUNTER, name=application_1549761699712_0010.driver.HiveExternalCatalog.fileCacheHits, count=0
19/02/11 21:56:00 INFO metrics: type=COUNTER, name=application_1549761699712_0010.driver.HiveExternalCatalog.filesDiscovered, count=0
19/02/11 21:56:00 INFO metrics: type=COUNTER, name=application_1549761699712_0010.driver.HiveExternalCatalog.hiveClientCalls, count=0
19/02/11 21:56:00 INFO metrics: type=COUNTER, name=application_1549761699712_0010.driver.HiveExternalCatalog.parallelListingJobCount, count=0
19/02/11 21:56:00 INFO metrics: type=COUNTER, name=application_1549761699712_0010.driver.HiveExternalCatalog.partitionsFetched, count=0
19/02/11 21:56:00 INFO metrics: type=HISTOGRAM, name=application_1549761699712_0010.driver.CodeGenerator.compilationTime, count=0, min=0, max=0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0
19/02/11 21:56:00 INFO metrics: type=HISTOGRAM, name=application_1549761699712_0010.driver.CodeGenerator.generatedClassSize, count=0, min=0, max=0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0
19/02/11 21:56:00 INFO metrics: type=HISTOGRAM, name=application_1549761699712_0010.driver.CodeGenerator.generatedMethodSize, count=0, min=0, max=0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0
19/02/11 21:56:00 INFO metrics: type=HISTOGRAM, name=application_1549761699712_0010.driver.CodeGenerator.sourceCodeSize, count=0, min=0, max=0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0
19/02/11 21:56:00 INFO metrics: type=TIMER, name=application_1549761699712_0010.driver.DAGScheduler.messageProcessingTime, count=10760, min=9.999999999999999E-5, max=9.895289, mean=0.39610863933167473, stddev=1.1862531925728736, median=0.052801999999999995, p75=0.153805, p95=3.748509, p98=4.677436, p99=5.120849, p999=8.462446, mean_rate=9.381494965500876, m1=16.47311897813088, m5=14.75906615830696, m15=8.503909252857786, rate_unit=events/second, duration_unit=milliseconds

[EDIT] I posted the thread's model as Text Classifier, however I'm mentioning Sentiment Analysis. Has anyone ever had this low accuracy on LSTM for Text Classifier application? Thanks!

wzhongyuan commented 5 years ago

which example were you running ?

Bfzanchetta commented 5 years ago

I'm running Sentiment Analysis example on BigDL. I'm using 2 head nodes and 8 workers. Each node contains 16 vCPUs and 120GB of RAM. They are all designed into a Spark2 cluster with Hadoop and YARN.

Interesting points: I was having the same issue on the example Text Classifier for LSTM, where after 15 or 20 epochs it kept ending with <5% accuracy rates. However, I missed a passage on BigDL's wiki that said that training distributed LSTM networks demands higher number of epochs to achieve the same accuracy than a regular CNN model at epoch 15.

I will test this in Sentiment Analysis.

jason-dai commented 5 years ago

For sentiment analysis, you may take a look at https://github.com/intel-analytics/analytics-zoo/tree/master/apps/sentiment-analysis

YongyiZhou commented 5 years ago

Hello @Bfzanchetta, I'm now trying to build a textclassifier model as well. But I met RPC lost while training, but I only used a 9G training set, with a 300G*4 cluster.

Could you show me your "build model" and "optimizer" code so I can figure out weather it's spark config problem or my app problem?

Thanks!