Open DevinBayly opened 2 years ago
Describe the bug
I'm creating a system for launching visit sessions with the U of A super computer, and managed to get everything working with visit 3.1.4. The discussion that I had to get this far is at this link #17120.
First, WOW! 🥇 Thats really great work recalling where you started 😉
I get a couple of issues when I try to do the same thing with visit version 3.2.1.
Just wanna confirm you are trying to use a 3.2 engine (e.g. server) with a 3.2.1 client, correct? The client and server must agree in major and minor version numbers. I am pretty sure you've already guarded for that but wanted to make sure.
In some instances the program freezes after trying to connect to the remote machine before I get a chance to select data and launch my compute engine session. I can verify that the remote machine shows processes started by visit, and I see log files in the home directory, but the logs end with some indication that the connection back to the host didn't work. I usually have to end the visit program from my local machine with ctrl-c.
/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl -dir /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64 -idle-timeout 480 -threads 16 -noloopback -host 10.141.250.198 -port 5602 -key 406161654b1ba84b5de4 VisIt component launcher started. ParentProcess::Connect: Called with (numRead=1, numWrite=2, argc=14, argv={/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl, -dir, /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64, -idle-timeout, 480, -threads, 16, -noloopback, -host, 10.141.250.198, -port, 5602, -key, 406161654b1ba84b5de4}) ParentProcess::Connect: hostName = 10.141.250.198 ParentProcess::GetHostInfo: Calling gethostbyname("10.141.250.198") ParentProcess::Connect: port = 5602 ParentProcess::Connect: securityKey = 406161654b1ba84b5de4 ParentProcess::Connect: Creating sockets ParentProcess::Connect: Creating read sockets ParentProcess::GetClientSocketDescriptor: Set up using port 5602 ParentProcess::GetClientSocketDescriptor: Creating a socket ParentProcess::GetClientSocketDescriptor: Setting socket options ParentProcess::GetClientSocketDescriptor: Calling connect (If you see no messages after this one, VisIt was not able to connect to the client machine. Nine times out of ten, this is a firewall issue on the client machine. It could also mean that VisIt was unable to resolve the IP address for the client machine. You may need to verify the contents of /etc/hosts.) ParentProcess::GetClientSocketDescriptor: Connected socket ParentProcess::Connect: Creating write sockets ParentProcess::GetClientSocketDescriptor: Set up using port 5602 ParentProcess::GetClientSocketDescriptor: Creating a socket ParentProcess::GetClientSocketDescriptor: Setting socket options ParentProcess::GetClientSocketDescriptor: Calling connect (If you see no messages after this one, VisIt was not able to connect to the client machine. Nine times out of ten, this is a firewall issue on the client machine. It could also mean that VisIt was unable to resolve the IP address for the client machine. You may need to verify the contents of /etc/hosts.) ParentProcess::GetClientSocketDescriptor: Connected socket ParentProcess::GetClientSocketDescriptor: Set up using port 5602 ParentProcess::GetClientSocketDescriptor: Creating a socket ParentProcess::GetClientSocketDescriptor: Setting socket options ParentProcess::GetClientSocketDescriptor: Calling connect (If you see no messages after this one, VisIt was not able to connect to the client machine. Nine times out of ten, this is a firewall issue on the client machine. It could also mean that VisIt was unable to resolve the IP address for the client machine. You may need to verify the contents of /etc/hosts.) ParentProcess::GetClientSocketDescriptor: Connected socket ParentProcess::Connect: Ordering the connections. Read 0 from writeConnections[0] Read 0 from readConnections[0] Read 0 from readConnections[1] ParentProcess::Connect: Exchanging type representations. signalhandler_core: SIGSEGV!
A SEGV
down in the bowels of this exchange smells like either a timing or protocol (version mismatch) issue.
The other problem that I seem to have is related to a version error. I don't fully understand how this is possible since both the client and the server versions of visit are 3.2.1.
(base) lil@vis:~/Documents/visit3_2_1.linux-x86_64/bin$ ./visit -debug 5 ./visit: 120: [[: not found System python ./frontendlauncher.py ./visit -debug 5 Running: gui3.2.1 -debug 5 ./visit: 120: [[: not found System python ./frontendlauncher.py ./visit -v 3.2 -viewer -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host vis -port 5600 -key 9be97f62cdab2e8f87fd Running: viewer3.2.1 -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host 127.0.0.1 -port 5600 ./visit: 120: [[: not found System python ./frontendlauncher.py ./visit -v 3.2 -mdserver -debug 5 -host vis -port 5601 -key 9be97f62cdab2e8f87fd Running: mdserver3.2.1 -debug 5 -host 127.0.0.1 -port 5601 Version 3.2 of VisIt does not exist.
The .visit: XXX [[: not found
looks familiar. I think its a non-bashism. And, I think we've been in adjusting that code (look around line 135
in the diff) recently.
Hi @markcmiller86 , thanks for jumping on this so fast!
Just wanna confirm you are trying to use a 3.2 engine (e.g. server) with a 3.2.1 client, correct? The client and server must agree in major and minor version numbers. I am pretty sure you've already guarded for that but wanted to make sure.
I believe I have, but the fact I didn't think engine numbers were different could be part of the issue. I believe my engine might be that version just because I got the https://github.com/visit-dav/visit/releases/download/v3.2.2/visit3_2_2.linux-x86_64-rhel7.tar.gz from here https://visit-dav.github.io/visit-website/releases-as-tables/#series-32 which belatedly I suppose means a 3.2 engine? I previously had only thought about client versions and tried to make them match.
A SEGV down in the bowels of this exchange smells like either a timing or protocol (version mismatch) issue.
Ah let me clear my logs and recreate the time out log tomorrow. I'm afraid maybe I posted a log from the wrong error.
The .visit: XXX [[: not found looks familiar. I think its a non-bashism. And, I think we've been in adjusting that code (look around line 135 in the diff) recently.
Yea they come up regardless of whether the connection works or not, but they are a bit distracting. The diff you pointed does look like a better way to go.
Do you have any other files I can help provide to indicate what's happening when I get the version mismatch error?
Here's a log from the login node of my HPC that was created when the system experiences a hang on connection.
cat A.vcl.5.vlog
/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl -dir /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64 -idle-timeout 480 -threads 16 -noloopback -host 10.141.250.198 -port 5603 -key d7ac50285216873ed48f
VisIt component launcher started.
ParentProcess::Connect: Called with (numRead=1, numWrite=2, argc=14, argv={/groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64/3.2.1/linux-x86_64/bin/vcl, -dir, /groups/chrisreidy/baylyd/consults/visit_remote_support/visit3_2_1.linux-x86_64, -idle-timeout, 480, -threads, 16, -noloopback, -host, 10.141.250.198, -port, 5603, -key, d7ac50285216873ed48f})
ParentProcess::Connect: hostName = 10.141.250.198
ParentProcess::GetHostInfo: Calling gethostbyname("10.141.250.198")
ParentProcess::Connect: port = 5603
ParentProcess::Connect: securityKey = d7ac50285216873ed48f
ParentProcess::Connect: Creating sockets
ParentProcess::Connect: Creating read sockets
ParentProcess::GetClientSocketDescriptor: Set up using port 5603
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine. Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine. You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Creating write sockets
ParentProcess::GetClientSocketDescriptor: Set up using port 5603
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine. Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine. You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::GetClientSocketDescriptor: Set up using port 5603
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
(If you see no messages after this one, VisIt was not
able to connect to the client machine. Nine times out
of ten, this is a firewall issue on the client machine.
It could also mean that VisIt was unable to resolve the
IP address for the client machine. You may need to verify the contents of /etc/hosts.)
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
Read 0 from writeConnections[0]
Read 0 from readConnections[0]
Read 0 from readConnections[1]
ParentProcess::Connect: Exchanging type representations.
signalhandler_core: SIGSEGV!
This freezing behavior was created with a mac client using visit 3.2.1 and the same host file. Again the entire process has to be forcibly terminated with the system monitor on the client side which may account for the SIGSEGV at the end of the log here.
@DevinBayly apoligies for delay in responding after your follow-up. Any progress on this?
Hello there,
Naw no changes, but thanks for checking in.
I switched to recommending users launch visit within an open on demand environment when they needed visit 3.2.1. this meant we also needed to grab the mesa rhel binary but that wasn't a problem.
Any other ideas how i can pin point where the failure is happening that leads to the hang?
On Tue, Feb 8, 2022, 4:44 PM markcmiller86 @.***> wrote:
External Email
@DevinBayly https://github.com/DevinBayly apoligies for delay in responding after your follow-up. Any progress on this?
— Reply to this email directly, view it on GitHub https://github.com/visit-dav/visit/issues/17289#issuecomment-1033174242, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZFNT6QRVHYTOFD4BAUTCDU2GTGTANCNFSM5NH4XMUQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Ok, your vcl
log shows a SEGV
(segmentation violation...which is a serious bug). From there, the vcl
process likely either hangs or dies entirely and the connection back to your client either doesn't happen or is lost. Can you grep
for SEGV
on any of the other (if any) .vlog
files on either your client or the remote resource?
The funny thing about that SEGV is I believe I observed the hang, looked at the log file and didn't see that line at the bottom, went back to the client and forcibly closed it using the task manager, and then when that had finished I noticed the SIGV at the end of the remote log. Is that possible? That the hang actually occurred first before the SEGV?
I'll grep the other logs the next time that I'm on that system, and make sure to get back to you.
Cheers, Devin
On Tue, Feb 8, 2022 at 5:02 PM markcmiller86 @.***> wrote:
External Email
Ok, your vcl log shows a SEGV (segmentation violation...which is a serious bug). From there, the vcl process likely either hangs or dies entirely and the connection back to your client either doesn't happen or is lost. Can you grep for SEGV on any of the other (if any) .vlog files on either your client or the remote resource?
— Reply to this email directly, view it on GitHub https://github.com/visit-dav/visit/issues/17289#issuecomment-1033183291, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZFNTZJD7C3U4Z2DHZDBFDU2GVJLANCNFSM5NH4XMUQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
The funny thing about that SEGV is I believe I observed the hang, looked at the log file and didn't see that line at the bottom, went back to the client and forcibly closed it using the task manager, and then when that had finished I noticed the SIGV at the end of the remote log.
Ah, ok. Well, that potentially explains the SEGV
then. The vcl
could have been forced into a funky state as a result of other activity and it isn't on the critical path. Hmmm....
There are 3 instances of this logging info...
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::GetClientSocketDescriptor: Set up using port 5603
ParentProcess::GetClientSocketDescriptor: Creating a socket
ParentProcess::GetClientSocketDescriptor: Setting socket options
ParentProcess::GetClientSocketDescriptor: Calling connect
Follwed by one instance of this logging info...
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
ParentProcess::GetClientSocketDescriptor: Connected socket
ParentProcess::Connect: Ordering the connections.
Read 0 from writeConnections[0]
Read 0 from readConnections[0]
Read 0 from readConnections[1]
ParentProcess::Connect: Exchanging type representations.
signalhandler_core: SIGSEGV!
I am thinking normal behavior, the next line would have been something like....
Read 0 from writeConnections[1]
That said, I am finding this log info a tad confusing...
[0]
do a read on writeConnection
followed by readConnection
but the first entry for [1]
is a read on readConnection
?0
? Is that a count of the number of bytes read or is it reading the literal number 0
in each case?I obviously don't have sufficient experience with these connections to know.
Ok, so based on a brief read of the associated code, https://github.com/visit-dav/visit/blob/d1fbae6caeb3364954f3c2b54eb0a27ae69fd889/src/common/comm/ParentProcess.C#L450-L463
readConnection
and the writeConnection
objects.0
Since the logging info shows read on writeConnection[0]
but none for a writeConnection[1]
, I believe vcl
is seeing nWriteConnections=1
and nReadConnections>=2
. I wonder if that is odd state for the vcl
...to have more read connections than write connections?
In looking at the code section identified above, I wonder why we use int()
in the debug5
message to display contents of buf[3]
but don't also use int()
when buf[3]
is used to index into connections[buf[3]]
on line 454 and 462? @brugger1 might you have any insights here?
Sorry for delay! Also, I think that I need to check if the hang issue is actually #13246
I guess this might mean that the main issue is back to being why the system thinks there's a mismatch between the 3.2.1 client and 3.2 server.
just reposting that material from above
The other problem that I seem to have is related to a version error. I don't fully understand how this is possible since both the client and the server versions of visit are 3.2.1.
(base) lil@vis:~/Documents/visit3_2_1.linux-x86_64/bin$ ./visit -debug 5 ./visit: 120: [[: not found System python ./frontendlauncher.py ./visit -debug 5 Running: gui3.2.1 -debug 5 ./visit: 120: [[: not found System python ./frontendlauncher.py ./visit -v 3.2 -viewer -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host vis -port 5600 -key 9be97f62cdab2e8f87fd Running: viewer3.2.1 -debug 5 -geometry 744x687+550+54 -borders 26,4,4,4 -shift 0,0 -preshift 4,26 -defer -host 127.0.0.1 -port 5600 ./visit: 120: [[: not found System python ./frontendlauncher.py ./visit -v 3.2 -mdserver -debug 5 -host vis -port 5601 -key 9be97f62cdab2e8f87fd Running: mdserver3.2.1 -debug 5 -host 127.0.0.1 -port 5601 Version 3.2 of VisIt does not exist.
This happened once with the 3.1.4 but then I started the visit client with ./visit -debug 5 -v 3.1.4 instead of ./visit -debug 5 and then it worked.
Describe the bug
I'm creating a system for launching visit sessions with the University of Arizona super computer, and managed to get everything working with visit 3.1.4. The discussion that I had to get this far is at this link https://github.com/visit-dav/visit/discussions/17120.
I get a couple of issues when I try to do the same thing with visit version 3.2.1. In some instances the program freezes after trying to connect to the remote machine before I get a chance to select data and launch my compute engine session. I can verify that the remote machine shows processes started by visit, and I see log files in the home directory, but the logs end with some indication that the connection back to the host didn't work. I usually have to end the visit program from my local machine with ctrl-c.
The other problem that I seem to have is related to a version error. I don't fully understand how this is possible since both the client and the server versions of visit are 3.2.1.
This happened once with the 3.1.4 but then I started the visit client with
./visit -debug 5 -v 3.1.4
instead of./visit -debug 5
and then it worked.In neither of these instances did visit crash, it just either froze or presented the version error and stopped the remote connection attempt.
Desktop
I can try to provide any relevant tests and files that you need. Thanks so much!