microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 830 forks source link

Optimal value of chunkSize #1761

Open acastelli1 opened 1 year ago

acastelli1 commented 1 year ago

SynapseML version

"com.microsoft.azure:synapseml_2.12:0.9.5-13-d1b51517-SNAPSHOT",

System information

Google Dataproc 2.0

Describe the problem

What is the optimal value of chunkSize? From the documentation I see that it should be set as the number of rows in the dataframe, is it simply like that? What if the dataframe has several million of rows? Is there a maximum value.

Documentation is not very clear

Code to reproduce issue

params_dict = dict( objective=args.objective, labelCol=args.target, earlyStoppingRound=args.early_stopping_rounds, numThreads=args.n_threads, featuresCol="features", categoricalSlotNames=cat_features_indexed, validationIndicatorCol="validation", verbosity=1, numIterations=1000, metric="mse", maxBin=args.max_bin, featuresShapCol="shap_vals", numLeaves=args.max_leaves, useBarrierExecutionMode=args.barrier_exec_mode, isProvideTrainingMetric=True, numBatches=args.n_batches, chunkSize=train_valid_test.count() )

Other info / logs

[2022-12-09 14:54:41.894]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : 11/org.reactivestreams_reactive-streams-1.0.3.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-handler-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-handler-proxy-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-buffer-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-codec-http-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-codec-http2-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-transport-native-unix-common-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-transport-native-kqueue-4.1.68.Final-osx-x86_64.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.projectreactor.netty_reactor-netty-http-1.0.11.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-common-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-resolver-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-transport-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-codec-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-codec-socks-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-resolver-dns-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-resolver-dns-native-macos-4.1.68.Final-osx-x86_64.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.projectreactor.netty_reactor-netty-core-1.0.11.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/io.netty_netty-codec-dns-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/com.github.vowpalwabbit_vw-jni-8.9.1.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000011/com.microsoft.ml.lightgbm_lightgbmlib-3.2.110.jar > /var/log/hadoop-yarn/userlogs/application_1670596110290_0005/container_1670596110290_0005_01_000011/stdout 2> /var/log/hadoop-yarn/userlogs/application_1670596110290_0005/container_1670596110290_0005_01_000011/stderr Last 4096 bytes of stderr : M task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:37 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:37 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:37 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:37 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:37 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send current task info to driver: 10.149.57.17:12489 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Using singleDatasetMode. Is main worker: false for task id: 39418 and main task id: 39405 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 6 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 6 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing DoneSignal to 5 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12488 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Using singleDatasetMode. Is main worker: false for task id: 39442 and main task id: 39405 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 7 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 7 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing DoneSignal to 6 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12488 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Using singleDatasetMode. Is main worker: false for task id: 39450 and main task id: 39405 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 8 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 8 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing DoneSignal to 7 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12488 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM worker got nodes for network init: 10.149.57.127:12473,10.149.57.17:12489,10.149.57.131:12512,10.149.57.124:12424,10.149.57.125:12498,10.149.57.25:12408,10.149.57.130:12432,10.149.57.129:12440,10.149.57.47:12448,10.149.57.21:12480,10.149.57.132:12505,10.149.57.128:12416,10.149.57.19:12464,10.149.57.126:12456 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task listening on: 12489 [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. terminate called after throwing an instance of 'std::runtime_error' what(): Memory exhausted! Cannot allocate new ChunkedArray chunk.

. 22/12/09 14:54:41 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1670596110290_0005_01_000004 on host: train-commission-frac-model-v0-15faby-w-8.c.b-ppc-bidding-dqs-1cmcp4c.internal. Exit status: 134. Diagnostics: main task id: 39380 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 8 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 8 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing DoneSignal to 7 22/12/09 14:54:41 WARN com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Could not bind to port 12432... 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12433 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM worker got nodes for network init: 10.149.57.127:12473,10.149.57.17:12489,10.149.57.131:12512,10.149.57.124:12424,10.149.57.125:12498,10.149.57.25:12408,10.149.57.130:12432,10.149.57.129:12440,10.149.57.47:12448,10.149.57.21:12480,10.149.57.132:12505,10.149.57.128:12416,10.149.57.19:12464,10.149.57.126:12456 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task listening on: 12432 [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. terminate called recursively terminate called after throwing an instance of 'std::runtime_error'

[2022-12-09 14:54:41.916]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : 04/org.reactivestreams_reactive-streams-1.0.3.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-handler-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-handler-proxy-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-buffer-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-codec-http-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-codec-http2-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-transport-native-unix-common-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-transport-native-kqueue-4.1.68.Final-osx-x86_64.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.projectreactor.netty_reactor-netty-http-1.0.11.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-common-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-resolver-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-transport-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-codec-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-codec-socks-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-resolver-dns-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-resolver-dns-native-macos-4.1.68.Final-osx-x86_64.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.projectreactor.netty_reactor-netty-core-1.0.11.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/io.netty_netty-codec-dns-4.1.68.Final.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/com.github.vowpalwabbit_vw-jni-8.9.1.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1670596110290_0005/container_1670596110290_0005_01_000004/com.microsoft.ml.lightgbm_lightgbmlib-3.2.110.jar > /var/log/hadoop-yarn/userlogs/application_1670596110290_0005/container_1670596110290_0005_01_000004/stdout 2> /var/log/hadoop-yarn/userlogs/application_1670596110290_0005/container_1670596110290_0005_01_000004/stderr Last 4096 bytes of stderr : ml.lightgbm.LightGBMRegressor: Could not bind to port 12432... 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send current task info to driver: 10.149.57.130:12432 22/12/09 14:54:38 WARN com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Could not bind to port 12433... 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12434 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:38 WARN com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Could not bind to port 12432... 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12433 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Using singleDatasetMode. Is main worker: false for task id: 39394 and main task id: 39380 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 7 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 7 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing DoneSignal to 6 22/12/09 14:54:38 WARN com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Could not bind to port 12432... 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12433 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Using singleDatasetMode. Is main worker: false for task id: 39447 and main task id: 39380 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 8 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing ArrayProcessedSignal to 8 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Task incrementing DoneSignal to 7 22/12/09 14:54:41 WARN com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Could not bind to port 12432... 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12433 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM worker got nodes for network init: 10.149.57.127:12473,10.149.57.17:12489,10.149.57.131:12512,10.149.57.124:12424,10.149.57.125:12498,10.149.57.25:12408,10.149.57.130:12432,10.149.57.129:12440,10.149.57.47:12448,10.149.57.21:12480,10.149.57.132:12505,10.149.57.128:12416,10.149.57.19:12464,10.149.57.126:12456 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task listening on: 12432 [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. terminate called recursively terminate called after throwing an instance of 'std::runtime_error'

. 22/12/09 14:54:41 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 8 for reason Container from a bad node: container_1670596110290_0005_01_000008 on host: train-commission-frac-model-v0-15faby-w-0.c.b-ppc-bidding-dqs-1cmcp4c.internal. Exit status: 134. Diagnostics: apse.ml.lightgbm.LightGBMRegressor: Successfully bound to port 12465 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task connecting to host: 10.149.57.45 and port: 46233 22/12/09 14:54:38 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: send empty status to driver 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM worker got nodes for network init: 10.149.57.127:12473,10.149.57.17:12489,10.149.57.131:12512,10.149.57.124:12424,10.149.57.125:12498,10.149.57.25:12408,10.149.57.130:12432,10.149.57.129:12440,10.149.57.47:12448,10.149.57.21:12480,10.149.57.132:12505,10.149.57.128:12416,10.149.57.19:12464,10.149.57.126:12456 22/12/09 14:54:41 INFO com.microsoft.azure.synapse.ml.lightgbm.LightGBMRegressor: LightGBM task listening on: 12464 [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. [LightGBM] [Fatal] Memory exhausted! Cannot allocate new ChunkedArray chunk. terminate called after throwing an instance of 'std::runtime_error' terminate called recursively terminate called recursively terminate called recursively terminate called recursively

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

mhamilton723 commented 1 year ago

@svotaw i think this might be another example that could benefit from the new streaming APIs just checked in

svotaw commented 1 year ago

I don't believe there is an "optimal" value for chunkSize. It depends on a lot of things and would have to be determined for your dataset. I'd suggest trying our newest streaming execution mode, which does not depend on chunkSize at all.

can you try this?  com.microsoft.azure:synapseml_2.12:0.10.2-21-7785cb5e-SNAPSHOT

you'll need to set executionMode="streaming"

This new mode avoids using any intermediate chunked memory at all, so is far more efficient. It is brand new, so is not in a public minor release yet.

andrejfd commented 1 year ago

@svotaw those maven coordinates don't seem to work. I am running into similar memory issues when using larg-er scale data (100M rows * 25 features)

svotaw commented 1 year ago

@andrejfd Can you try this more recent one? What are you using the library from? If Scala, you might need an extra resolver. Maybe it also just expired. Hoepfully soon we will cut a newer official version with this code.

com.microsoft.azure:synapseml_2.12:0.10.2-71-a7e20ce3-SNAPSHOT

resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"

andrejfd commented 1 year ago

@svotaw Great, will try. Any idea when regular release might be? Currently using sampling with lightgbm in python but would like to scale data to several billion records?

I was able to fit about 100MM records in a training session by increasing memory for spark.driver.maxResultSize,

when I increase to 200MM records in the set I result in org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(47, 10) finished unsuccessfully.

Let's assume that compute is not an issue. I've tried various configurations and repartitions, is there a way to get around this error in bulk execution mode?

svotaw commented 1 year ago

I'm hoping that within a few weeks, we'll release a new minor version that will include the streaming mode updates. We'd have to know more about your specific scenario (#partitions, #nodes, #executors) to give some advice. Unfortunately, with LightGBM, there can be no failures or the whole training stops. This is a limitation of LightGBM algorithm, which requires all nodes to run at once. This is one benefit of streaming mode, since you can squeeze more data into nodes and use fewer of them and reduce chance of error.

Are you using the Barrier Execution Mode? What happens if you don't?

andrejfd commented 1 year ago

I was using barrier execution mode. Disabling shows OOM, so the only way I could fix this is by increasing driver size and setting spark.driver.maxResultSize to be larger. Seems non-trivial to scale to very large datasets. Chunking the dataset seems to lead to significantly worse results so I try to keep it in one. Definitely looking forward to the streaming execution mode.

svotaw commented 1 year ago

We have a new release with all streaming features implemented. 11.2. We will soon release the formal 1.0 version.

(bulk mode will not be worked on anymore)