Orted hanging(?) while importing TensorFlow.

camda03 commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

I'm running Open MPI in support of TensorFlow Version = 2.7.0 Sometimes, for no readily apparent reason, importing TensorFlow "import tensorflow as tf" hangs.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

$ ompi_info Package: Debian OpenMPI Open MPI: 4.0.3 Open MPI repo revision: v4.0.3 Open MPI release date: Mar 03, 2020 Open RTE: 4.0.3 Open RTE repo revision: v4.0.3 Open RTE release date: Mar 03, 2020 OPAL: 4.0.3 OPAL repo revision: v4.0.3 OPAL release date: Mar 03, 2020 MPI API: 3.1.0 Ident string: 4.0.3 Prefix: /usr Configured architecture: x86_64-pc-linux-gnu Configure host: lcy01-amd64-020 Configured by: buildd Configured on: Wed Apr 15 13:14:35 UTC 2020 Configure host: lcy01-amd64-020

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

As part of Tensorflow 2.7.0 or Unbuntu 20.04.03 LTS.

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`. n/a

Please describe the system on which you are running

Operating system/version: Ubuntu 20.04.03 LTS
Computer hardware: AMD® Ryzen threadripper pro 3995wx 64-cores × 128 , 4 x NVIDIA A6000, 1TB RAM.
Network type: Wired and Wireless. Wireless not currently in use.

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.23.188 netmask 255.255.255.0 broadcast 192.168.23.255 ether 3c:ec:ef:7f:e5:d6 txqueuelen 1000 (Ethernet) RX packets 108917 bytes 58854446 (58.8 MB) RX errors 4101693796888 dropped 13 overruns 0 frame 0 TX packets 106113 bytes 16633028 (16.6 MB) TX errors 4093103833088 dropped 0 overruns 0 carrier 0 collisions 0

enxb03af2b6059f: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether b0:3a:f2:b6:05:9f txqueuelen 1000 (Ethernet) RX packets 17 bytes 1224 (1.2 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 97 bytes 16758 (16.7 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 3c:ec:ef:7f:e4:3a txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device memory 0xf3d00000-f3d7ffff

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1000 (Local Loopback) RX packets 3721281 bytes 4901452415 (4.9 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3721281 bytes 4901452415 (4.9 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I would like to not have orted/OpenMPI issues when I'm trying to load TensorFlow.

I'm running OpenMPI in support of TensorFlow Version = 2.7.0 Sometimes, for no readily apparent reason, importing TensorFlow "import tensorflow as tf" hangs. That's the end of the road, so to speak. There is no recovery from this hanging.

When this happens, orted processes pile up. dead_orted.txt I've verified that the orted process count goes up by one with each TensorFlow hanging incident.

Here is an strace of the failure. They all fail in the same spot. Sometimes this works, however. I have a log of this (success) as well if you'd like. tf_strace_fail.txt

1643319577.136261 stat("/usr/bin/orted", {st_mode=S_IFREG|0755, st_size=14648, ...}) = 0 1643319577.136290 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f7108822a10) = 328055 1643319577.137636 close(8) = 0 1643319577.137656 close(9) = 0 1643319577.137676 read(7, strace: Process 327927 detached

The process was detached because I had to kill it with ctrl-c.

This happens "randomly". Any insights as to what causes this and how to prevent it would be much appreciated.

Thanks!

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world

rhc54 commented 2 years ago

I'm not that familiar with the current state of MPI integration with Tensorflow, but it doesn't look all that solid from what you have provided. They'd be a lot better off running it under PRRTE as opposed to what they appear to be doing (which is to start a whole bunch of singletons). OMPI singletons involve some significant overhead as we have to assume the procs will do a "comm_spawn", which means each singleton has to create its own RTE support. This is what PRRTE was created to resolve, and it works quite well for other modeling environments.

Can you perhaps point me to someone I can work with to improve this scenario?

As for solving the current situation: there isn't enough info to diagnose the cause. It isn't clear that the orted is even hanging - there is some interaction that occurs between the orted and the parent singleton during startup, and it is quite possible that the daemon is fine and there is something on the singleton end that is hung. I suspect you are seeing the orted count go up because there is something in Tensorflow that continues to kickoff more singletons, but I have no idea what the logic is inside that modeling code or why it is doing so.

camda03 commented 2 years ago

Thanks for your response and for putting up with my limited understanding of the inner workings of OpenMPI. Here's a ticket I created with the TensorFlow team on this issue. https://github.com/tensorflow/tensorflow/issues/54218One of their developers or support team members is on this ticket. I appreciate your explanation of what might be going wrong here.As you properly point out, orted may not be actually hung.All I can really say (with my limited knowledge) is that after a TensorFlow import failure, there is one more of those orted processes than there was before. Hopefully the ticket I've provided will help to setup some discussion between the two teams.That should benefit a lot of people. If I can provide other information please don't hesitate to reach out to me.

Thanks very much for your help! Dave On 2/17/2022 at 10:55 PM, "Ralph Castain" wrote: I'm not that familiar with the current state of MPI integration with Tensorflow, but it doesn't look all that solid from what you have provided. They'd be a lot better off running it under PRRTE as opposed to what they appear to be doing (which is to start a whole bunch of singletons). OMPI singletons involve some significant overhead as we have to assume the procs will do a "comm_spawn", which means each singleton has to create its own RTE support. This is what PRRTE was created to resolve, and it works quite well for other modeling environments.

Can you perhaps point me to someone I can work with to improve this

scenario?

As for solving the current situation: there isn't enough info to

diagnose the cause. It isn't clear that the orted is even hanging - there is some interaction that occurs between the orted and the parent singleton during startup, and it is quite possible that the daemon is fine and there is something on the singleton end that is hung. I suspect you are seeing the orted count go up because there is something in Tensorflow that continues to kickoff more singletons, but I have no idea what the logic is inside that modeling code or why it is doing so.

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

rhc54 commented 2 years ago

Please see https://github.com/tensorflow/tensorflow/issues/54450

camda03 commented 2 years ago

Excellent!!! Thanks very much!!! Dave

On 2/18/2022 at 10:04 AM, "Ralph Castain" wrote: Please see tensorflow/tensorflow#54450

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

camda03 commented 2 years ago

Are there any suggested workaround or things to try to get around this in the meantime?

This worked yesterday and it's hanging today.

Behavior is the same today as what's in the log files and the tickets.

I'm just trying to work with the GPUs on my local machine.

Yesterday I (thought I) got around the hanging issue by kill -9ing the orted processes but that isn't working today either.

Part of the problem is that I have two models running so I can't kill all of the orted processes. (I'm not sure if that was really what cleared the hanging problem yesterday.)

Any input and advice would be greatly appreciated!

Thanks!

Dave

rhc54 commented 2 years ago

I'm not entirely sure of exactly what you are doing, but have you tried using "mpirun" to start your application procs? Scanning thru our prior issues that mention Horovod, this seems to be a common solution.

camda03 commented 2 years ago

I'm not doing anything directly with OpenMPI. I'm just running TensorFlow though a Jupyter Notebook. The following command reproduces the behavior without using Jupyter i.e. it hangs when the Jupyter Notebook hangs and vice versa.

strace -ttt python -c 'import tensorflow as tf; print(tf.version)' | tee -a strace-python.txt

This ends in: 1645317911.314069 stat("/usr/bin/orted", {st_mode=S_IFREG|0755, st_size=14648, ...}) = 0 1645317911.314104 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7efc8241ea10) = 3704039 1645317911.315939 close(8) = 0 1645317911.315960 close(9) = 0 1645317911.315984 read(7, ^Cstrace: Process 3703908 detached when it hangs, which it is doing at the moment.

Thanks! Dave On 2/19/2022 at 4:39 PM, "Ralph Castain" wrote: I'm not entirely sure of exactly what you are doing, but have you tried using "mpirun" to start your application procs? Scanning thru our prior issues that mention Horovod, this seems to be a common solution.

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

rhc54 commented 2 years ago

Afraid I have no advice to offer on that use-case (I have no idea what Jupyter Notebook even is). Obviously, something inside the tensorflow package is calling MPI_Init, but I have no insight into where or why. My comment was just that if you are starting a python application, then if you have OMPI installed you could do "mpirun python myapp" and that might satisfy the temsorflow import problem.

Since the tensorflow package appears tied to OMPI, you must have OMPI installed somewhere it can "see" - so "mpirun" must be available. Might as well use it.

camda03 commented 2 years ago

I tried the following modified command and got the results below. The script still hung but it generated different output.

$ strace -ttt mpirun python -c 'import tensorflow as tf; print(tf.version)' | tee -a strace-python.txt 1645319891.011617 close(10) = 0 1645319891.011627 getsockname(9, {sa_family=AF_INET, sin_port=htons(54692), sin_addr=inet_addr("127.0.0.1")}, [124->16]) = 0 1645319891.011642 fcntl(9, F_GETFL) = 0x2 (flags O_RDWR) 1645319891.011652 fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0 1645319891.011662 fcntl(9, F_SETFD, FD_CLOEXEC) = 0 1645319891.011673 poll([{fd=9, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=9, revents=POLLOUT}]) 1645319891.011687 writev(9, [{iov_base="lv2220", iov_len=12}, {iov_base="", iov_len=0}, {iov_base="MIT-MAGIC-COOKIE-1", iov_len=18}, {iov_base="", iov_len=2}, {iov_base="c274U237242Ie300]421731:`", iov_len=16}, {iov_base="", iov_len=0}], 6) = 48 1645319891.011717 recvfrom(9, 0x55d93d020aa0, 8, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) 1645319891.011730 poll([{fd=9, events=POLLIN}], 1, -1 {script hung here as well} Thanks for looking at this! I'll follow up with mpirun and see what I can find. Thanks! Dave

On 2/19/2022 at 8:14 PM, "Ralph Castain" wrote: Afraid I have no advice to offer on that use-case (I have no idea what Jupyter Notebook even is). Obviously, something inside the tensorflow package is calling MPI_Init, but I have no insight into where or why. My comment was just that if you are starting a python application, then if you have OMPI installed you could do "mpirun python myapp" and that might satisfy the temsorflow import problem.

Since the tensorflow package appears tied to OMPI, you must have OMPI

installed somewhere it can "see" - so "mpirun" must be available. Might as well use it.

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

ggouaillardet commented 2 years ago

When you see a hang, can you get the stack trace of the (python) TensorFlow process?

orted might appear to be stuck ... because is has nothing to do.

camda03 commented 2 years ago

stack_trace.txt Will this work?

Thanks!

Dave

ggouaillardet commented 2 years ago

I am sorry I cannot make sense of this output.

why don't you simply run TensorFlow (nostrace whatsoever), and then pstack <pid of TensorFlow> when it hangs?

camda03 commented 2 years ago

Here's what I got. 0 S david 3922630 3183114 99 80 0 - 1470274 pipe_r 21:41 pts/4 00:00:13 python -c import tensorflow as tf2; print(tf2.version) 0 S david 3923166 3907974 0 80 0 - 2261 pipe_r 21:42 pts/5
00:00:00 grep --color=auto -i python @.***:~$ sudo pstack 3922630

3922630: python -c import tensorflow as tf2; print(tf2.version) (No symbols found) 0x7f3b2b7ac17c: ???? (0, 0, 0, 0, 0, 0) + fffffffffc73a730

Thanks! Dave

On 2/19/2022 at 9:32 PM, "Gilles Gouaillardet" wrote: I am sorry I cannot make sense of this output.

why don't you simply run TensorFlow (nostrace whatsoever), and then

pstack when it hangs?

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

ggouaillardet commented 2 years ago

unfortunately that did not help :-(

a previous message mentions MIT-MAGIC-COOKIE-1, which is related to the X11 server.

What is the value of the DISPLAY environment variable? if it is set, can you unset it and try again?

The root cause could be X11 and not MPI.

camda03 commented 2 years ago

Yes, that wasn't a very impressive output. :-)

DISPLAY was DISPLAY=:0 I unset it. env | grep -i DISPLAY {nothing} It still failed. Here's the strace tail FYI.

1645326628.760110 stat("/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_singleton.so", {st_mode=S_IFREG|0644, st_size=27704, ...}) = 0 1645326628.760138 openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_singleton.so", O_RDONLY|O_CLOEXEC) = 7 1645326628.760157 read(7, "177ELF2113>1340'"..., 832) = 832 1645326628.760176 fstat(7, {st_mode=S_IFREG|0644, st_size=27704, ...}) = 0 1645326628.760194 mmap(NULL, 29544, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 7, 0) = 0x7fcf19b53000 1645326628.760212 mmap(0x7fcf19b55000, 12288, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 7, 0x2000) = 0x7fcf19b55000 1645326628.760234 mmap(0x7fcf19b58000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 7, 0x5000) = 0x7fcf19b58000 1645326628.760251 mmap(0x7fcf19b59000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 7, 0x5000) = 0x7fcf19b59000 1645326628.760273 close(7) = 0 1645326628.760319 mprotect(0x7fcf19b59000, 4096, PROT_READ) = 0 1645326628.760404 pipe([7, 8]) = 0 1645326628.760424 pipe([9, 10]) = 0 1645326628.760443 stat("/usr/bin/orted", {st_mode=S_IFREG|0755, st_size=14648, ...}) = 0 1645326628.760470 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fcf881a6a10) = 3976364 1645326628.762885 close(8) = 0 1645326628.762900 close(9) = 0 1645326628.762919 read(7, ^C0x3bd4f10, 255) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) strace: Process 3975843 detached

Thanks! Dave

On 2/19/2022 at 10:03 PM, "Gilles Gouaillardet" wrote: unfortunately that did not help :-(

a previous message mentions MIT-MAGIC-COOKIE-1, which is related to

the X11 server.

What is the value of the DISPLAY environment variable?

if it is set, can you unset it and try again?

The root cause could be X11 and not MPI. 

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

ggouaillardet commented 2 years ago

well, that was worth trying anyway :-)

Let's do it the hard way then:

strace -f -s 8192 -o tf.strace --  python -c 'import tensorflow as tf2; print(tf2.__version__)'

then please do compress tf.strace and upload it so I can have a look

camda03 commented 2 years ago

Here it is. This is really the .Z file but I had to rename it to get it uploaded.

tf.strace.zip

BTW this is my latest ifconfig.

$ ifconfig eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 3c:ec:ef:7f:e4:3a txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device memory 0xf3d00000-f3d7ffff

enxb03af2b6059f: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether b0:3a:f2:b6:05:9f txqueuelen 1000 (Ethernet) RX packets 44 bytes 3008 (3.0 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 30 bytes 5626 (5.6 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.23.188 netmask 255.255.255.0 broadcast 192.168.23.255 ether 3c:ec:ef:7f:e5:d6 txqueuelen 1000 (Ethernet) RX packets 284116 bytes 237283185 (237.2 MB) RX errors 14023068408520 dropped 188 overruns 0 frame 0 TX packets 187870 bytes 23601903 (23.6 MB) TX errors 14031658156032 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 1000 (Local Loopback) RX packets 11720369 bytes 25679669972 (25.6 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11720369 bytes 25679669972 (25.6 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Thanks!

Dave

ggouaillardet commented 2 years ago

orted is stuck communicating with 127.0.0.1:6006. Can you start TF (no strace here) and then ptrace -p <pid_of_orted> when it is stuck?

At this stage, my best bet is hwloc is stuck trying to communicate with the gpu driver.

try running xhost + and see if if helps (note you might not be able to do this...)

If you have root access on your box, I would suggest you try to move /usr/lib/x86_64-linux-gnu/hwloc/hwloc_gl.so and /usr/lib/x86_64-linux-gnu/hwloc/hwloc_opencl.so to an other place and give it an other try. (do not forget to restore them when you are done!)

@bgoglin can you please share some insights and a possible workaround?

camda03 commented 2 years ago

TensorBoard is running on 127.0.0.1:6006. tensorboard --logdir=./my_logs

This is used to show the results of training runs (among other things) in TensorFlow. Would that be what you're seeing? Thanks! Dave On 2/19/2022 at 11:38 PM, "Gilles Gouaillardet" wrote: orted is stuck communicating with 127.0.0.1:6006. Can you start TF (no strace here) and then ptrace -p when it is stuck?

At this stage, my best bet is hwloc is stuck trying to communicate

with the gpu driver.

try running xhost + and see if if helps (note you might not be able

to do this...)

If you have root access on your box, I would suggest you try to move

/usr/lib/x86_64-linux-gnu/hwloc/hwloc_gl.so and /usr/lib/x86_64-linux-gnu/hwloc/hwloc_opencl.so to an other place and give it an other try. (do not forget to restore them when you are done!)

@bgoglin can you please share some insights and a possible

workaround?

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

ggouaillardet commented 2 years ago

ports 6000 and following are generally used by X11 servers, so my guess hwloc try/scan all of them ... and something goes wrong because TensorBoard is not an X11 server.

so yeah, run TensorBoard on an other port and see how it goes!

camda03 commented 2 years ago

tensorboard --port 9999 --logdir=./my_logs

It worked!!!!!! 1645333587.361210 munmap(0x7f215416b000, 33554432) = 0 1645333587.361229 munmap(0x7f215196a000, 33554432) = 0 1645333587.361246 munmap(0x7f214f169000, 33554432) = 0 1645333587.361264 munmap(0x7f214c968000, 33554432) = 0 1645333587.361282 munmap(0x7f214a968000, 33554432) = 0 1645333587.361300 munmap(0x7f2148167000, 33554432) = 0 1645333587.361592 exit_group(0) = ? 1645333587.370186 +++ exited with 0 +++

Amazing!!! Thanks!! Dave

On 2/19/2022 at 11:54 PM, "Gilles Gouaillardet" wrote: ports 6000 and following are generally used by X11 servers, so my guess hwloc try/scan all of them ... and something goes wrong because TensorBoard is not an X11 server.

so yeah, run TensorBoard on an other port and see how it goes! 

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

camda03 commented 2 years ago

I just restarted a TensorFlow model and that worked too! It was hanging all day today.

Thanks!!!! Dave

On 2/19/2022 at 11:54 PM, "Gilles Gouaillardet" wrote: ports 6000 and following are generally used by X11 servers, so my guess hwloc try/scan all of them ... and something goes wrong because TensorBoard is not an X11 server.

so yeah, run TensorBoard on an other port and see how it goes! 

—

Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID:

markwdalton commented 2 years ago

Thank you! For this! At least it gets us down the right path. The network configuration on this machine is different than how we normally setup machines.

The X11 and orted is a common issue we have seen on HPC and clusters over the last decades.

However, I have not seen this in the last 5-10 years. Yes, commonly X11 setup of xauthority/xhosts or for orted it was normally network configuration issues.

So I will dig into the tensorboard port and hwloc_gl.so/hwloc_opencl.so. I still suspect the unique network configuration on this machine. (it dynamically relabels the interface - versus assigned to the MAC address and also is relabeled differently than other machines).

Mark

ggouaillardet commented 2 years ago

@markwdalton I think this is a much simpler issue.

If a daemon (that is not a X11 server) is running on a port generally assigned to a X11 server, hwloc might try to talk to it and then all bets are off: in this case it seems hwloc/OpenCL/OpenGL send something to TensorBoard (thinking it is an X11 server), and TensorBoard chooses not to reply anything (since it was not a valid TensorBoard protocol/request) causing orted and then the MPI application to hang forever.

rhc54 commented 2 years ago

This feels like a bug to me in whichever library is attempting to query the X11 server. Assuming that a socket in the 6000+ range must be an X11 server is a pretty big assumption - seems to me that the library should include a timeout feature to deal with exactly this scenario.

@bgoglin I suspect this must be happening in one of the secondary libraries (OpenCL or OpenGL) and not hwloc itself, yes? Can you perhaps pass this problem upstream to the right place so they can resolve it?

bgoglin commented 1 year ago

For some reason I never received the github notification about this bug. A colleague just got the same issue and we resolved it by disabling the hwloc GL plugin with "export HWLOC_COMPONENTS=-gl". I don't remember if we ever talked about OMPI/ORTE blacklisting some hwloc components. Something like this (between init() and load()) should be enough if using hwloc >= 2.1:

hwloc_topology_set_components(topology, HWLOC_TOPOLOGY_COMPONENTS_FLAG_BLACKLIST, "gl");

FYI, I am considering removing the GL backend in future releases because it's not clear anybody still uses it, it seems buggy on wayland, and has annoying issues like this one.

rhc54 commented 1 year ago

I can certainly add that to ORTE/PRRTE/PMIx. I gather there is nothing equivalent in hwloc < 2.1? We do still see users running at older versions (e.g., 1.11). Or is the gl support something that only started in 2.1 so we can ignore it for earlier versions?

bgoglin commented 1 year ago

By the way, the issue of the library not having a timeout is in core X11 libraries (in XOpenDisplay()). Those are veeery old and pretty much abandoned. It won't ever be fixed. Port 6000 is officially documented as Xserver port in /etc/services. Some distros also mark 6001-6007. Some don't. Opening 6006 in tensorboard would have been a bad idea 20 years ago, but nowadays running multiple X servers is rare. Anyway, I am talking to the devs who contributed the GL backend to see if all this still makes sense in 2023.

markwdalton commented 1 year ago

We would change the port. It would also conflict at times with tensorboard and another tensorflow job on the same node. Surprisingly I do not get the question to often, and it has been over a year for this question from our customers.

But moving to PyTorch also helped, as many have moved.

On Fri, Sep 22, 2023 at 4:32 PM Brice Goglin @.***> wrote:

By the way, the issue of the library not having a timeout is in core X11 libraries (in XOpenDisplay()). Those are veeery old and pretty much abandoned. It won't ever be fixed. Port 6000 is officially documented as Xserver port in /etc/services. Some distros also mark 6001-6007. Some don't. Opening 6006 in tensorboard would have been a bad idea 20 years ago, but nowadays running multiple X servers is rare. Anyway, I am talking to the devs who contributed the GL backend to see if all this still makes sense in 2023.

— Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/10025#issuecomment-1732054259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGYAD77NHZGVWFWOACHGS3X3X7WJANCNFSM5OWP73HA . You are receiving this because you were mentioned.Message ID: @.***>

bjoe2k4 commented 2 months ago

I have just been bitten by this. A user was running something on TCP 127.0.0.1:600x .

OpenMPI v4.1.6 + PMIX v4.2.9 (+ HWLOC 2.7.0-2ubuntu1 from Ubuntu 22.04), and running interactively via mpirun -np x ... or srun via Slurm (pmix_v4). Setting the environment variable helped, however I was under the impression that this should not happen anymore (the fix already landed in pmix v4.2.7). OpenMPI v5.0.3 + pmix v5.0.3 did not suffer from this (fix landed in pmix v5.0.2).

rhc54 commented 2 months ago

I strongly doubt we went backwards on it, but I can offer no explanation unless you are accidentally pulling in an earlier version via your library path.

open-mpi / ompi