radical-cybertools / radical.entk

The RADICAL Ensemble Toolkit
https://radical-cybertools.github.io/entk/index.html
Other
28 stars 17 forks source link

Follow up on Stampede2 terminating connection after 24 hours #551

Closed akashban closed 3 years ago

akashban commented 3 years ago

Hi @andre-merzky @lee212

Just wanted to follow up with you regarding the network connection issue. I was running 2 entk scripts which aimed at launching simulation workflows on stampede2 over the weekend. Both were terminated when they crossed 24 hours (asked for 48 hours of wall time). I just wanted to show you the error message and make sure that the issue is caused by stampede2 and not something from my end: (I have attached the basic log file from the host and remote ends. I am not able to attach the modified sandbox (after the radical-fetch command) due to its size)

imb F 32% step 239100, will finish Mon Feb 15 13:03:59 2021 imb F 30% step 239200, will finish Mon Feb 15 13:03:59 2021 imb F 32% step 239300, will finish Mon Feb 15 13:03:59 2021 Connection to c443-131 closed by remote host.^M [mpiexec@c438-133.stampede2.tacc.utexas.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:864): connection to proxy 2 at host c443-131 failed [mpiexec@c438-133.stampede2.tacc.utexas.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@c438-133.stampede2.tacc.utexas.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event [mpiexec@c438-133.stampede2.tacc.utexas.edu] main (../../ui/mpich/mpiexec.c:1149): process manager error waiting for completion

Thank you.

Kind Regards, Akash debug_host.log debug_remote.log

andre-merzky commented 3 years ago

Hey @akashban : I am glad you are asking, but that looks like a somewhat different issue:

Connection to c443-131 closed by remote host

This seems to be a connection between compute nodes, and likely affected the task startup. I assume you are using xsede.stampede2_ssh as resource label? I would like to ask you to switch to xsede.stampede2_mirun or xsede.stampede2_mirun, depending if you application likes mpirun or ibrun more (if you don't care please use mpirun)

akashban commented 3 years ago

Hi @andre-merzky , yes, I am using xsede.stampede2_ssh. Is there a typo in the above comment? Did you mean xsede.stampede2_mpirun or xsede.stampede2_ibrun? Is there a document with all the available resource labels?

andre-merzky commented 3 years ago

Did you mean xsede.stampede2_mpirun or xsede.stampede2_ibrun?

Oh, apologies, I did.

Is there a document with all the available resource labels?

Documentation on those labels is vague, mostly because the labels and configurations we use for different workloads and projects vary widely and often. The current configurations for XSEDE (including stampede2) are listed here, but, as said, they may change from release to release. We are working on stabilizing the configs, but are not there yet...

akashban commented 3 years ago

Thanks for sharing the documentation on resource labels. I will check in with you before I use a new resource tag in future entk versions. Xsede.stampede2_ibrun is working fine on the development partition of Stampede2. I will keep you posted regarding the performance on the normal partition (will take 2 days to finish).

akashban commented 3 years ago

Hi @andre-merzky , It took me some time to get results from stampede2 (power outage in texas). I am using the xsede.stampede2_ibrun resource tag, and I am still getting the same error after around 24 hours:

Connection to c429-051 closed by remote host.^M [mpiexec@c407-064.stampede2.tacc.utexas.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:864): connection to proxy 3 at host c429-051 failed [mpiexec@c407-064.stampede2.tacc.utexas.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@c407-064.stampede2.tacc.utexas.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event [mpiexec@c407-064.stampede2.tacc.utexas.edu] main (../../ui/mpich/mpiexec.c:1149): process manager error waiting for completion

I have attached the basic log files. debug_host.log debug_remote.log

akashban commented 3 years ago

Hi @andre-merzky , Just wanted to follow up on the above. I am still getting the same error when I try to run on stampede2. I just wanted to check if stampede is terminating my connection, or is there something I am doing wrong.

shantenujha commented 3 years ago

@andre-merzky @iparask -- gentle ping to try to unstuck this as its stalling science progress.

andre-merzky commented 3 years ago

Sorry @akashban : this fell off the table.

But also, this is not good news: I am afraid that we are currently not able to avoid that connection error, The failure happens on a level below our own code, and we don't have the ability to change the behavior of that layer.

@shantenujha : This is the launcher (ibrun, and underneath mpiexec) failing. Any suggestion on how to proceed? Do we have someone at TACC we can ping for support?

andre-merzky commented 3 years ago

@shantenujha, @akashban: any suggestion on how to proceed?

akashban commented 3 years ago

Hi @andre-merzky , I am currently using the normal partition on stampede2 to launch my jobs. They are typically on the queue for 15-20 hours, and are kicked out after around 10 hours of simulation. I have just developed a method to take the restart files (simulation in its final state when it got kicked out), and launch it again using the ENTK script. I repeat the process until I get the desirable simulation run time. Please let me know of your thoughts on this brute force process. Is there a better way of doing it? I am doing this to quickly get some data until the connection issue is fixed.

andre-merzky commented 3 years ago

Thanks for the reply! I can't think of a better way right now, really. Is there a way to trade, say, x long running simulations against x/2 simulations which run half as long?

akashban commented 3 years ago

Yes, I could do that. In my MD simulations, I could vary the total number of MD steps (reduce it to half), or the timestep (double it). Both strategies either leads to simulation failure, or not being able to visualize the scientific properties of interest. The reason why my workflow needs to run over 2 days is that my systems are really big. Hence, in order too observe interesting scientific phenomena, the simulations need to run for 2 days.

andre-merzky commented 3 years ago

Both strategies either leads to simulation failure, or not being able to visualize the scientific properties of interest

degradation of science results is obviously not an acceptable tradeoff :-P We'll discuss this in our group and will try to come up with a solution - but I am not sure if and how quick we can resolve this...

A couple of questions, if I may:

Thanks, Andre.

akashban commented 3 years ago
  1. I am using my desktop in my lab (ubuntu 16). I use a tmux terminal and launch the entk script from there.
  2. Yes, I can use the login node to submit jobs (by launching a slurm script)
  3. Yes, my workflow does: long simulation -> analysis -> long simulation -> analysis -> and so on.

Please let me know if you have any more questions. I could share some more specifics if required.

mturilli commented 3 years ago

Hi @akashban, apologies for the delay to get back to you on this. Assuming this is still an issue for you, here a summary and a couple of possible ways forward.

We tried and discussed a couple of ways to mitigate this problem but, unfortunately, we did not find a viable solution that can be implemented in a short amount of time. We have a medium-term solution planned but it will probably become available at the end of the year.

As mentioned, we may consider 2 workarounds:

  1. Using checkpoints. Run for 24 hours, checkpoint the simulations and start from those checkpoints for another 24 hours. This add queue time and complexity in your application. We would be available to help with your application if you will choose this approach.
  2. Running from the login node. This might mitigate the issue but disconnect might still happen and/or your application might be killed by the scripts of the sysadmins that check on login node utilization. Probably worth a try.
akashban commented 3 years ago

Hi @mturilli, Thank you so much for your detailed inputs. I have used the first approach (Using checkpoints) to generate data for our publication. It seemed like the most straight forward workaround. I would need some help regarding launching entk scripts from a login node. Could you please point me to the documentation that talks about running entk from a login node?

andre-merzky commented 3 years ago

@akashban : on Stampede2, running from the login node is not too different than running from your laptop, with one exception: you would need to run the script in a tmux session, so that the shell does not get closed when you disconnect. Are you familiar with tmux (or screen)?

akashban commented 3 years ago

@andre-merzky : Got it, I can run entk scripts on a tmux session from the Stampede2 login node. Will keep you posted on this.

andre-merzky commented 3 years ago

Thanks - let us know if you need further help on this.