visit-dav / visit

VisIt - Visualization and Data Analysis for Mesh-based Scientific Data
https://visit.llnl.gov
BSD 3-Clause "New" or "Revised" License
438 stars 116 forks source link

Client Server setup works for 3.1.4 but hangs mysteriously for 3.2.1 #17289

Open DevinBayly opened 2 years ago

DevinBayly commented 2 years ago

Describe the bug

I'm creating a system for launching visit sessions with the University of Arizona super computer, and managed to get everything working with visit 3.1.4. The discussion that I had to get this far is at this link https://github.com/visit-dav/visit/discussions/17120.

I get a couple of issues when I try to do the same thing with visit version 3.2.1. In some instances the program freezes after trying to connect to the remote machine before I get a chance to select data and launch my compute engine session. I can verify that the remote machine shows processes started by visit, and I see log files in the home directory, but the logs end with some indication that the connection back to the host didn't work. I usually have to end the visit program from my local machine with ctrl-c.

/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl -dir /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64 -idle-timeout 480 -threads 16 -noloopback -host 10.141.250.198 -port 5602 -key 406161654b1ba84b5de4 
VisIt component launcher started.
ParentProcess::Connect: Called with (numRead=1, numWrite=2, argc=14, argv={/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl, -dir, /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64, -idle-timeout, 480, -threads, 16, -noloopback, -host, 10.141.250.198, -port, 5602, -key, 406161654b1ba84b5de4})
ParentProcess::Connect: hostName = 10.141.250.198
ParentProcess::GetHostInfo: Calling gethostbyname("10.141.250.198")
ParentProcess::Connect: port = 5602
ParentProcess::Connect: securityKey = 406161654b1ba84b5de4
ParentProcess::Connect: Creating sockets
ParentProcess::Connect: Creating read sockets
ParentProcess::GetClientSocketDescriptor: Set up using port 5602
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Creating write sockets
ParentProcess::GetClientSocketDescriptor: Set up using port 5602
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::GetClientSocketDescriptor: Set up using port 5602
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
Read 0 from writeConnections[0]
Read 0 from readConnections[0]
Read 0 from readConnections[1]
ParentProcess::Connect: Exchanging type representations.
signalhandler_core: SIGSEGV!

The other problem that I seem to have is related to a version error. I don't fully understand how this is possible since both the client and the server versions of visit are 3.2.1.

(base) lil@vis:~/Documents/visit3_2_1.linux-x86_64/bin$ ./visit -debug 5
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -debug 5
Running: gui3.2.1 -debug 5
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -v 3.2 -viewer -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host vis -port 5600 -key 9be97f62cdab2e8f87fd
Running: viewer3.2.1 -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host 127.0.0.1 -port 5600
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -v 3.2 -mdserver -debug 5 -host vis -port 5601 -key 9be97f62cdab2e8f87fd
Running: mdserver3.2.1 -debug 5 -host 127.0.0.1 -port 5601
Version 3.2 of VisIt does not exist.

This happened once with the 3.1.4 but then I started the visit client with ./visit -debug 5 -v 3.1.4 instead of ./visit -debug 5 and then it worked.

In neither of these instances did visit crash, it just either froze or presented the version error and stopped the remote connection attempt.

Desktop

I can try to provide any relevant tests and files that you need. Thanks so much!

markcmiller86 commented 2 years ago

Describe the bug

I'm creating a system for launching visit sessions with the U of A super computer, and managed to get everything working with visit 3.1.4. The discussion that I had to get this far is at this link #17120.

First, WOW! 🥇 Thats really great work recalling where you started 😉

I get a couple of issues when I try to do the same thing with visit version 3.2.1.

Just wanna confirm you are trying to use a 3.2 engine (e.g. server) with a 3.2.1 client, correct? The client and server must agree in major and minor version numbers. I am pretty sure you've already guarded for that but wanted to make sure.

In some instances the program freezes after trying to connect to the remote machine before I get a chance to select data and launch my compute engine session. I can verify that the remote machine shows processes started by visit, and I see log files in the home directory, but the logs end with some indication that the connection back to the host didn't work. I usually have to end the visit program from my local machine with ctrl-c.

/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl -dir /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64 -idle-timeout 480 -threads 16 -noloopback -host 10.141.250.198 -port 5602 -key 406161654b1ba84b5de4 
VisIt component launcher started.
ParentProcess::Connect: Called with (numRead=1, numWrite=2, argc=14, argv={/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl, -dir, /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64, -idle-timeout, 480, -threads, 16, -noloopback, -host, 10.141.250.198, -port, 5602, -key, 406161654b1ba84b5de4})
ParentProcess::Connect: hostName = 10.141.250.198
ParentProcess::GetHostInfo: Calling gethostbyname("10.141.250.198")
ParentProcess::Connect: port = 5602
ParentProcess::Connect: securityKey = 406161654b1ba84b5de4
ParentProcess::Connect: Creating sockets
ParentProcess::Connect: Creating read sockets
ParentProcess::GetClientSocketDescriptor: Set up using port 5602
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Creating write sockets
ParentProcess::GetClientSocketDescriptor: Set up using port 5602
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::GetClientSocketDescriptor: Set up using port 5602
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
Read 0 from writeConnections[0]
Read 0 from readConnections[0]
Read 0 from readConnections[1]
ParentProcess::Connect: Exchanging type representations.
signalhandler_core: SIGSEGV!

A SEGV down in the bowels of this exchange smells like either a timing or protocol (version mismatch) issue.

The other problem that I seem to have is related to a version error. I don't fully understand how this is possible since both the client and the server versions of visit are 3.2.1.

(base) lil@vis:~/Documents/visit3_2_1.linux-x86_64/bin$ ./visit -debug 5
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -debug 5
Running: gui3.2.1 -debug 5
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -v 3.2 -viewer -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host vis -port 5600 -key 9be97f62cdab2e8f87fd
Running: viewer3.2.1 -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host 127.0.0.1 -port 5600
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -v 3.2 -mdserver -debug 5 -host vis -port 5601 -key 9be97f62cdab2e8f87fd
Running: mdserver3.2.1 -debug 5 -host 127.0.0.1 -port 5601
Version 3.2 of VisIt does not exist.

The .visit: XXX [[: not found looks familiar. I think its a non-bashism. And, I think we've been in adjusting that code (look around line 135 in the diff) recently.

DevinBayly commented 2 years ago

Hi @markcmiller86 , thanks for jumping on this so fast!

Just wanna confirm you are trying to use a 3.2 engine (e.g. server) with a 3.2.1 client, correct? The client and server must agree in major and minor version numbers. I am pretty sure you've already guarded for that but wanted to make sure.

I believe I have, but the fact I didn't think engine numbers were different could be part of the issue. I believe my engine might be that version just because I got the https://github.com/visit-dav/visit/releases/download/v3.2.2/visit3_2_2.linux-x86_64-rhel7.tar.gz from here https://visit-dav.github.io/visit-website/releases-as-tables/#series-32 which belatedly I suppose means a 3.2 engine? I previously had only thought about client versions and tried to make them match.

A SEGV down in the bowels of this exchange smells like either a timing or protocol (version mismatch) issue.

Ah let me clear my logs and recreate the time out log tomorrow. I'm afraid maybe I posted a log from the wrong error.

The .visit: XXX [[: not found looks familiar. I think its a non-bashism. And, I think we've been in adjusting that code (look around line 135 in the diff) recently.

Yea they come up regardless of whether the connection works or not, but they are a bit distracting. The diff you pointed does look like a better way to go.

Do you have any other files I can help provide to indicate what's happening when I get the version mismatch error?

DevinBayly commented 2 years ago

Here's a log from the login node of my HPC that was created when the system experiences a hang on connection.

cat A.vcl.5.vlog
/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl -dir /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64 -idle-timeout 480 -threads 16 -noloopback -host 10.141.250.198 -port 5603 -key d7ac50285216873ed48f 
VisIt component launcher started.
ParentProcess::Connect: Called with (numRead=1, numWrite=2, argc=14, argv={/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl, -dir, /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64, -idle-timeout, 480, -threads, 16, -noloopback, -host, 10.141.250.198, -port, 5603, -key, d7ac50285216873ed48f})
ParentProcess::Connect: hostName = 10.141.250.198
ParentProcess::GetHostInfo: Calling gethostbyname("10.141.250.198")
ParentProcess::Connect: port = 5603
ParentProcess::Connect: securityKey = d7ac50285216873ed48f
ParentProcess::Connect: Creating sockets
ParentProcess::Connect: Creating read sockets
ParentProcess::GetClientSocketDescriptor: Set up using port 5603
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Creating write sockets
ParentProcess::GetClientSocketDescriptor: Set up using port 5603
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::GetClientSocketDescriptor: Set up using port 5603
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine.  Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine.  You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
Read 0 from writeConnections[0]
Read 0 from readConnections[0]
Read 0 from readConnections[1]
ParentProcess::Connect: Exchanging type representations.
signalhandler_core: SIGSEGV!

This freezing behavior was created with a mac client using visit 3.2.1 and the same host file. Again the entire process has to be forcibly terminated with the system monitor on the client side which may account for the SIGSEGV at the end of the log here.

markcmiller86 commented 2 years ago

@DevinBayly apoligies for delay in responding after your follow-up. Any progress on this?

DevinBayly commented 2 years ago

Hello there,

Naw no changes, but thanks for checking in.

I switched to recommending users launch visit within an open on demand environment when they needed visit 3.2.1. this meant we also needed to grab the mesa rhel binary but that wasn't a problem.

Any other ideas how i can pin point where the failure is happening that leads to the hang?

On Tue, Feb 8, 2022, 4:44 PM markcmiller86 @.***> wrote:

External Email

@DevinBayly https://github.com/DevinBayly apoligies for delay in responding after your follow-up. Any progress on this?

— Reply to this email directly, view it on GitHub https://github.com/visit-dav/visit/issues/17289#issuecomment-1033174242, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZFNT6QRVHYTOFD4BAUTCDU2GTGTANCNFSM5NH4XMUQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

markcmiller86 commented 2 years ago

Ok, your vcl log shows a SEGV (segmentation violation...which is a serious bug). From there, the vcl process likely either hangs or dies entirely and the connection back to your client either doesn't happen or is lost. Can you grep for SEGV on any of the other (if any) .vlog files on either your client or the remote resource?

DevinBayly commented 2 years ago

The funny thing about that SEGV is I believe I observed the hang, looked at the log file and didn't see that line at the bottom, went back to the client and forcibly closed it using the task manager, and then when that had finished I noticed the SIGV at the end of the remote log. Is that possible? That the hang actually occurred first before the SEGV?

I'll grep the other logs the next time that I'm on that system, and make sure to get back to you.

Cheers, Devin

On Tue, Feb 8, 2022 at 5:02 PM markcmiller86 @.***> wrote:

External Email

Ok, your vcl log shows a SEGV (segmentation violation...which is a serious bug). From there, the vcl process likely either hangs or dies entirely and the connection back to your client either doesn't happen or is lost. Can you grep for SEGV on any of the other (if any) .vlog files on either your client or the remote resource?

— Reply to this email directly, view it on GitHub https://github.com/visit-dav/visit/issues/17289#issuecomment-1033183291, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZFNTZJD7C3U4Z2DHZDBFDU2GVJLANCNFSM5NH4XMUQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

markcmiller86 commented 2 years ago

The funny thing about that SEGV is I believe I observed the hang, looked at the log file and didn't see that line at the bottom, went back to the client and forcibly closed it using the task manager, and then when that had finished I noticed the SIGV at the end of the remote log.

Ah, ok. Well, that potentially explains the SEGV then. The vcl could have been forced into a funky state as a result of other activity and it isn't on the critical path. Hmmm....

There are 3 instances of this logging info...

ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::GetClientSocketDescriptor: Set up using port 5603
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect

Follwed by one instance of this logging info...

ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
Read 0 from writeConnections[0]
Read 0 from readConnections[0]
Read 0 from readConnections[1]
ParentProcess::Connect: Exchanging type representations.
signalhandler_core: SIGSEGV!

I am thinking normal behavior, the next line would have been something like....

Read 0 from writeConnections[1]

That said, I am finding this log info a tad confusing...

I obviously don't have sufficient experience with these connections to know.

markcmiller86 commented 2 years ago

Ok, so based on a brief read of the associated code, https://github.com/visit-dav/visit/blob/d1fbae6caeb3364954f3c2b54eb0a27ae69fd889/src/common/comm/ParentProcess.C#L450-L463

Since the logging info shows read on writeConnection[0] but none for a writeConnection[1], I believe vcl is seeing nWriteConnections=1 and nReadConnections>=2. I wonder if that is odd state for the vcl...to have more read connections than write connections?

markcmiller86 commented 2 years ago

In looking at the code section identified above, I wonder why we use int() in the debug5 message to display contents of buf[3] but don't also use int() when buf[3] is used to index into connections[buf[3]] on line 454 and 462? @brugger1 might you have any insights here?

DevinBayly commented 2 years ago

Sorry for delay! Also, I think that I need to check if the hang issue is actually #13246

I guess this might mean that the main issue is back to being why the system thinks there's a mismatch between the 3.2.1 client and 3.2 server.

just reposting that material from above

The other problem that I seem to have is related to a version error. I don't fully understand how this is possible since both the client and the server versions of visit are 3.2.1.

(base) lil@vis:~/Documents/visit3_2_1.linux-x86_64/bin$ ./visit -debug 5
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -debug 5
Running: gui3.2.1 -debug 5
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -v 3.2 -viewer -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host vis -port 5600 -key 9be97f62cdab2e8f87fd
Running: viewer3.2.1 -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host 127.0.0.1 -port 5600
./visit: 120: [[: not found
System python ./frontendlauncher.py ./visit -v 3.2 -mdserver -debug 5 -host vis -port 5601 -key 9be97f62cdab2e8f87fd
Running: mdserver3.2.1 -debug 5 -host 127.0.0.1 -port 5601
Version 3.2 of VisIt does not exist.

This happened once with the 3.1.4 but then I started the visit client with ./visit -debug 5 -v 3.1.4 instead of ./visit -debug 5 and then it worked.