Open jenniew opened 1 year ago
There are reasons for the issue:
Related PR: https://github.com/intel-analytics/BigDL/pull/8152
For the issue 2, change the rank number to partition id, so the rank number on each worker is distinct and continuous, and the init_process_group can run successfully. And we don't need to do the first job to get cluster info.
Related PR: https://github.com/intel-analytics/BigDL/pull/8188
I am having the same issue running pyspark orca on a yarn cluster. The issue occured in both stable and nightly build, and for both spark2 and spark3 versions.
Some of my runs using small number of nodes ( <= 30) were successfully, however, upon scaling to 50 nodes or more, there was no successful run yet.
This is my code to init orca context
init_orca_context(cluster_mode='yarn-client', num_nodes=100, cores=20, memory="16g", driver_memory="20g", driver_cores=8, extra_python_lib="model.py,config.zip,data.zip,loss.zip", conf={"spark.task.cpus": "20", "spark.dynamicAllocation.enabled": "false", "spark.driver.maxResultSize": "4g"})
For small number of nodes (<= 30): if runs were successful, the setup_distributed funcion in pytorch_pyspark_worker finished in 1 or 2 seconds. When the issue happened, the function stuck until the timeout (default is 30 mins).
@jenniew do you have other thoughts about what could be the reason? if any info and logs needed, I can help get to investigate this issue.
@truongnx15 What is your data source for Orca Estimator training? Is RDD/Dataframe or callable of pytorch data loader? Can you provide your error logs? We'll try to reproduce your issue to see how to fix it.
We're using Dataframe as the data source. My environment is python 3.7 with orca for Spark3 nightly build installed on 10 June 2023.
I ran with 50 num_nodes and 20 cores per node.
The error occured 30 minutes after the 50 logs like this for 50 partitions:
[partition = 1, ip = 10.197.84.206] [2023-06-13 21:45:39] INFO cluster is: [LIST OF THE IP:PORT FOR THE CLUSTER] [partition = 1, ip = 10.197.84.206] [2023-06-13 21:45:39] INFO Connected log server on 10.197.74.131:34785
below is the stacktrace:
`Traceback (most recent call last):
File "train_spark.py", line 178, in
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)`
@truongnx15, thank you for your information. Would you mind to attach the whole log file? What is the approximate size of your train data?
My dataframe is about 5M rows, 5 columns (schema is 3 long, 1 float, 1 array[long] with fixed 2 elements). Below is the full log files (of a different run on a different cluster using python 3.8) spark3_50nodes.log
@truongnx15 We create similar data and run with 50-80 executor with your spark configuration, but cannot reproduce your issue.
Thank you for having a look at it. I am still having the same issue. It even failed for small number of nodes sometimes, so I guess it's smth related to network or my yarn cluster setup but couldn't figure out yet.
Running Pytorch Pyspark Estimator training on multiple nodes on Kubernetes with big models sometimes got RuntimeError: Socket Timeout- when workers init_process_group. The trace back is as below: Exception: Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main process() File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 594, in process out_iter = func(split_index, iterator) File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2863, in func File "/opt/bigdl-2.3.0-SNAPSHOT/python/bigdl-orca-spark_3.1.3-2.3.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/pytorch_pyspark_estimator.py", line 370, in
File "/opt/bigdl-2.3.0-SNAPSHOT/python/bigdl-orca-spark_3.1.3-2.3.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/pytorch_pyspark_estimator.py", line 367, in transform_func
File "./bigdl-orca-spark_3.1.3-2.3.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/pytorch_pyspark_worker.py", line 93, in init
self.setup_distributed(self.mode, cluster_info, driver_ip, driver_tcp_store_port)
File "./bigdl-orca-spark_3.1.3-2.3.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/pytorch_pyspark_worker.py", line 116, in setup_distributed
self.setup_torch_distribute(tcp_store_host=driver_ip,
File "./bigdl-orca-spark_3.1.3-2.3.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/core/lifecycle.py", line 29, in setup_torch_distribute
self._init_torch_ddp(tcp_store_host, tcp_store_port, world_rank,
File "./bigdl-orca-spark_3.1.3-2.3.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/core/lifecycle.py", line 72, in _init_torch_ddp
dist.init_process_group(
File "/opt/spark/work-dir/lora2/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
default_pg = _new_process_group_helper(
File "/opt/spark/work-dir/lora2/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: Socket Timeout