Closed akashban closed 3 years ago
Hey @akashban : I am glad you are asking, but that looks like a somewhat different issue:
Connection to c443-131 closed by remote host
This seems to be a connection between compute nodes, and likely affected the task startup. I assume you are using xsede.stampede2_ssh
as resource label? I would like to ask you to switch to xsede.stampede2_mirun
or xsede.stampede2_mirun
, depending if you application likes mpirun or ibrun more (if you don't care please use mpirun)
Hi @andre-merzky , yes, I am using xsede.stampede2_ssh. Is there a typo in the above comment? Did you mean xsede.stampede2_mpirun or xsede.stampede2_ibrun? Is there a document with all the available resource labels?
Did you mean
xsede.stampede2_mpirun
orxsede.stampede2_ibrun
?
Oh, apologies, I did.
Is there a document with all the available resource labels?
Documentation on those labels is vague, mostly because the labels and configurations we use for different workloads and projects vary widely and often. The current configurations for XSEDE (including stampede2
) are listed here, but, as said, they may change from release to release. We are working on stabilizing the configs, but are not there yet...
Thanks for sharing the documentation on resource labels. I will check in with you before I use a new resource tag in future entk versions. Xsede.stampede2_ibrun is working fine on the development partition of Stampede2. I will keep you posted regarding the performance on the normal partition (will take 2 days to finish).
Hi @andre-merzky , It took me some time to get results from stampede2 (power outage in texas). I am using the xsede.stampede2_ibrun resource tag, and I am still getting the same error after around 24 hours:
Connection to c429-051 closed by remote host.^M [mpiexec@c407-064.stampede2.tacc.utexas.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:864): connection to proxy 3 at host c429-051 failed [mpiexec@c407-064.stampede2.tacc.utexas.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@c407-064.stampede2.tacc.utexas.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event [mpiexec@c407-064.stampede2.tacc.utexas.edu] main (../../ui/mpich/mpiexec.c:1149): process manager error waiting for completion
I have attached the basic log files. debug_host.log debug_remote.log
Hi @andre-merzky , Just wanted to follow up on the above. I am still getting the same error when I try to run on stampede2. I just wanted to check if stampede is terminating my connection, or is there something I am doing wrong.
@andre-merzky @iparask -- gentle ping to try to unstuck this as its stalling science progress.
Sorry @akashban : this fell off the table.
But also, this is not good news: I am afraid that we are currently not able to avoid that connection error, The failure happens on a level below our own code, and we don't have the ability to change the behavior of that layer.
@shantenujha : This is the launcher (ibrun, and underneath mpiexec) failing. Any suggestion on how to proceed? Do we have someone at TACC we can ping for support?
@shantenujha, @akashban: any suggestion on how to proceed?
Hi @andre-merzky , I am currently using the normal partition on stampede2 to launch my jobs. They are typically on the queue for 15-20 hours, and are kicked out after around 10 hours of simulation. I have just developed a method to take the restart files (simulation in its final state when it got kicked out), and launch it again using the ENTK script. I repeat the process until I get the desirable simulation run time. Please let me know of your thoughts on this brute force process. Is there a better way of doing it? I am doing this to quickly get some data until the connection issue is fixed.
Thanks for the reply! I can't think of a better way right now, really. Is there a way to trade, say, x
long running simulations against x/2
simulations which run half as long?
Yes, I could do that. In my MD simulations, I could vary the total number of MD steps (reduce it to half), or the timestep (double it). Both strategies either leads to simulation failure, or not being able to visualize the scientific properties of interest. The reason why my workflow needs to run over 2 days is that my systems are really big. Hence, in order too observe interesting scientific phenomena, the simulations need to run for 2 days.
Both strategies either leads to simulation failure, or not being able to visualize the scientific properties of interest
degradation of science results is obviously not an acceptable tradeoff :-P We'll discuss this in our group and will try to come up with a solution - but I am not sure if and how quick we can resolve this...
A couple of questions, if I may:
Thanks, Andre.
Please let me know if you have any more questions. I could share some more specifics if required.
Hi @akashban, apologies for the delay to get back to you on this. Assuming this is still an issue for you, here a summary and a couple of possible ways forward.
We tried and discussed a couple of ways to mitigate this problem but, unfortunately, we did not find a viable solution that can be implemented in a short amount of time. We have a medium-term solution planned but it will probably become available at the end of the year.
As mentioned, we may consider 2 workarounds:
Hi @mturilli, Thank you so much for your detailed inputs. I have used the first approach (Using checkpoints) to generate data for our publication. It seemed like the most straight forward workaround. I would need some help regarding launching entk scripts from a login node. Could you please point me to the documentation that talks about running entk from a login node?
@akashban : on Stampede2, running from the login node is not too different than running from your laptop, with one exception: you would need to run the script in a tmux
session, so that the shell does not get closed when you disconnect. Are you familiar with tmux
(or screen
)?
@andre-merzky : Got it, I can run entk scripts on a tmux session from the Stampede2 login node. Will keep you posted on this.
Thanks - let us know if you need further help on this.
Hi @andre-merzky @lee212
Just wanted to follow up with you regarding the network connection issue. I was running 2 entk scripts which aimed at launching simulation workflows on stampede2 over the weekend. Both were terminated when they crossed 24 hours (asked for 48 hours of wall time). I just wanted to show you the error message and make sure that the issue is caused by stampede2 and not something from my end: (I have attached the basic log file from the host and remote ends. I am not able to attach the modified sandbox (after the radical-fetch command) due to its size)
imb F 32% step 239100, will finish Mon Feb 15 13:03:59 2021 imb F 30% step 239200, will finish Mon Feb 15 13:03:59 2021 imb F 32% step 239300, will finish Mon Feb 15 13:03:59 2021 Connection to c443-131 closed by remote host.^M [mpiexec@c438-133.stampede2.tacc.utexas.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:864): connection to proxy 2 at host c443-131 failed [mpiexec@c438-133.stampede2.tacc.utexas.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@c438-133.stampede2.tacc.utexas.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event [mpiexec@c438-133.stampede2.tacc.utexas.edu] main (../../ui/mpich/mpiexec.c:1149): process manager error waiting for completion
Thank you.
Kind Regards, Akash debug_host.log debug_remote.log