Open trams opened 5 years ago
I relaunched the job and I did not manage to reproduce this stack trace. The job still hangs but without any error messages. I'll try to relaunch with debug logs enabled and I'll report to this issue
Also I tried to launch with gpu_hist
method and it exhibits the same behavior: hangs
Hi @trams, Could you provide:
2019-10-23 21:31:57 INFO RabitTracker$TrackerProcessLogger:58 - 2019-10-23 21:31:57,930 INFO @tracker All of 12 nodes getting started
Both driver and one or more executor's "ThreadDump" info from Spark webUI when it hangs. (access from "Executors -> ThreadDump -> Runnable thread -> details")
Your Spark environment. (version, cores, memory etc.)
It is possible that Rabit hangs, but I'm not sure if it's your case. Please provide the info above for us to track. Thanks!
I am sorry for a slow response.
I do not have a command line cause I do not launch xgboost Spark training job using a command line but rather sophisticated Spark YARN Submitter we build inside the company But I think I understand what you need and I will try to provide it
I lost the original log (I should've saved it) I will try to reproduce the problem and I will provide the log
Hello nice people,
I came across this article https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd (and that's why I create an issue) I am very excited to start using. It took some time to learn which versions are available: 1.0.0-Beta and 1.0.0-Beta2. I picked the latter one
when I launched distributed training without GPUs (tree method
hist
) to make sure CPU based trainings work I noticed it hanged. And I saw that there were 0 iteration done and I sawCould you point out to your source code repo and which version (git sha1) you used to build 1.0.0-Beta so I can try to troubleshoot.
Also any pointers how to work around are welcome. Can I enable scala based tracker? Do you know how?