XGBoost with GPU hangs on Rabit initialization

trams commented 5 years ago

Hello nice people,

I came across this article https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd (and that's why I create an issue) I am very excited to start using. It took some time to learn which versions are available: 1.0.0-Beta and 1.0.0-Beta2. I picked the latter one

when I launched distributed training without GPUs (tree method hist) to make sure CPU based trainings work I noticed it hanged. And I saw that there were 0 iteration done and I saw

2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - Traceback (most recent call last):
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     self.run()
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/usr/lib64/python2.7/threading.py", line 765, in run
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     self.__target(*self.__args, **self.__kwargs)
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 324, in run
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     self.accept_slaves(nslave)
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 268, in accept_slaves
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     s = SlaveEntry(fd, s_addr)
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -   File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 64, in __init__
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -     assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - AssertionError: invalid magic number=542393671 from 172.28.42.144
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - Tracker Process ends with exit code 0
2019-10-17 00:22:29 INFO  ml.dmlc.xgboost4j.java.RabitTracker - Tracker Process ends with exit code 0
2019-10-17 00:22:29 INFO  XGBoostSpark - Rabit returns with exit code 0

Could you point out to your source code repo and which version (git sha1) you used to build 1.0.0-Beta so I can try to troubleshoot.

Also any pointers how to work around are welcome. Can I enable scala based tracker? Do you know how?

trams commented 5 years ago

I relaunched the job and I did not manage to reproduce this stack trace. The job still hangs but without any error messages. I'll try to relaunch with debug logs enabled and I'll report to this issue

Also I tried to launch with gpu_hist method and it exhibits the same behavior: hangs

wjxiz1992 commented 5 years ago

Hi @trams, Could you provide:

your whole cmd line parameters.
The driver log around where it hangs, like:

2019-10-23 21:31:57 INFO  RabitTracker$TrackerProcessLogger:58 - 2019-10-23 21:31:57,930 INFO @tracker All of 12 nodes getting started

Both driver and one or more executor's "ThreadDump" info from Spark webUI when it hangs. (access from "Executors -> ThreadDump -> Runnable thread -> details")
Your Spark environment. (version, cores, memory etc.)

It is possible that Rabit hangs, but I'm not sure if it's your case. Please provide the info above for us to track. Thanks!

trams commented 5 years ago

I am sorry for a slow response.

I do not have a command line cause I do not launch xgboost Spark training job using a command line but rather sophisticated Spark YARN Submitter we build inside the company But I think I understand what you need and I will try to provide it
I lost the original log (I should've saved it) I will try to reproduce the problem and I will provide the log

rapidsai / spark-examples

XGBoost with GPU hangs on Rabit initialization #61