radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Gromacs/LSDMap fail on STAMPEDE when num_iterations=5 #144

Closed ashkurti closed 9 years ago

ashkurti commented 9 years ago

The execution takes almost an hour and then ends in an error :(

ExTASY log at: https://gist.github.com/ashkurti/bc7e0492ffa67fb3abba

andre-merzky commented 9 years ago

This looks like another unicode error:

2015:02:09 10:39:03 radical.pilot.MainProcess: [ERROR   ] uworker Thread-3 stopped otransfer OutputFileTransferWorker-2
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/remote/python/2.7.8/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 302, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 202, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/users/ardita/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/extasy/bin/runme.py", line 52, in unit_state_change_cb
    print u"STDERR : {0}".format(unit.stderr)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 307: ordinal not in range(128)
ashkurti commented 9 years ago

But with num_iterations=3 it did work fine ...

CharlieLaughton commented 9 years ago

I get a similar error with num_iterations=4. Looking on Stampede it seems the error is in the analysis step (lsdm.py). In the corresponding pilot-* subdirectory, the STDERR file contains:

...
 ->  frame   1190 time    0.000      
 ->  frame   1260 time    0.000        ->  frame   1200 time    0.000      
 ->  frame   1330 time    0.000        ->  frame   1300 time    0.000      
 ->  frame   1410 time    0.0 0.000    ->  frame   1400 time    0.000      
 ->  frame   1480 time    0.000      

gcq#253: "Fly to the Court of England and Unfold" (Macbeth, Act 3, Scene 6, Will
iam Shakespeare)

Inactive Modules:
  1) gromacs

The following have been reloaded with a version change:
  1) intel/13.0.2.146 => intel/14.0.1.106  2) mvapich2/1.9a2 => mvapich2/2.0b

The following have been reloaded with a version change:
  1) python/2.7.3-epd-7.3.2 => python/2.7.6

Traceback (most recent call last):
  File "lsdm.py", line 384, in <module>
    LSDMap().run()
  File "lsdm.py", line 320, in run
    weights_thread = np.array([self.weights[idx] for idx in self.idxs_thread])
IndexError: index out of bounds
[cli_15]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
[c404-701.stampede.tacc.utexas.edu:mpispawn_0][readline] Unexpected End-Of-File 
on file descriptor 12. MPI process died?
[c404-701.stampede.tacc.utexas.edu:mpispawn_0][mtpmi_processops] Error while rea
ding PMI socket. MPI process died?
[c404-701.stampede.tacc.utexas.edu:mpispawn_0][child_handler] MPI process (rank:
 15, pid: 22670) exited with status 1
Traceback (most recent call last):
  File "lsdm.py", line 384, in <module>
Traceback (most recent call last):
  File "lsdm.py", line 384, in <module>
Traceback (most recent call last):
  File "lsdm.py", line 384, in <module>
    LSDMap().run()
    LSDMap().run()
  File "lsdm.py", line 349, in run
  File "lsdm.py", line 349, in run
        kernel = self.compute_kernel(comm, npoints_thread, distance_matrix_threa
d, weights_thread, epsilon_thread)
LSDMap().run()
  File "lsdm.py", line 181, in compute_kernel
  File "lsdm.py", line 349, in run
    kernel = self.compute_kernel(comm, npoints_thread, distance_matrix_thread, w
eights_thread, epsilon_thread)
  File "lsdm.py", line 181, in compute_kernel
    np.exp(-distance_matrix_thread**2/(2*epsilon_thread[:, np.newaxis].dot(self.
epsilon[np.newaxis])))
ValueError: operands could not be broadcast together with shapes (92,1445) (92,1
482)     
kernel = self.compute_kernel(comm, npoints_thread, distance_matrix_thread, weigh
ts_thread, epsilon_thread)
  File "lsdm.py", line 181, in compute_kernel
    np.exp(-distance_matrix_thread**2/(2*epsilon_thread[:, np.newaxis].dot(self.
epsilon[np.newaxis])))
ValueError: operands could not be broadcast together with shapes (93,1445) (93,1
482) 
...

I notice the numbers 1482 and 1445 correspond to the numbers of snapshots in the .gro file and lines in the weights.w file respectively - should these not be the same?

jp43 commented 9 years ago

Thank you. Charlie, do you think you could allow access to your radical pilot folder and subdirectories (or copy them somewhere)? If yes, please let me know where they are located? There is a couple of things that I would need to check to be able to spot the issue. As I told Vivek, the UnicodeEncodeError seems to appear after the analysis step failed, so it might (somehow) just be a consequence of the ValueError you mentioned.

jp43 commented 9 years ago

Sorry by radical pilot folder, I mean the corresponding pilot folder in your sandbox.

CharlieLaughton commented 9 years ago

Hi Jordane,

OK, I have copied the pilot-* folder on Stampede to /scratch/01915/laughton and made it group-readable, please can you check that you can see it OK?

Best wishes,

Charlie

From: jp43 notifications@github.com<mailto:notifications@github.com> Reply-To: radical-cybertools/ExTASY reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, 11 February 2015 02:33 To: radical-cybertools/ExTASY ExTASY@noreply.github.com<mailto:ExTASY@noreply.github.com> Cc: CharlieLaughton charles.laughton@nottingham.ac.uk<mailto:charles.laughton@nottingham.ac.uk> Subject: Re: [ExTASY] Gromacs/LSDMap fail on STAMPEDE when num_iterations=5 (#144)

Thank you. Charlie, do you think you could allow access to your radical pilot folder and subdirectories (or copy them somewhere)? If yes, please let me know where they are located? There is a couple of things that I would need to check to be able to spot the issue. As I told Vivek, the UnicodeEncodeError seems to appear after the analysis step failed, so it might (somehow) just be a consequence of the ValueError you mentioned.

— Reply to this email directly or view it on GitHubhttps://github.com/radical-cybertools/ExTASY/issues/144#issuecomment-73825341.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.

Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system, you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.

marksantcroos commented 9 years ago

@vivek-bala: its Pär Bjelkmar's fault ;-)

vivek-bala commented 9 years ago

The pilot walltime is set to 60 mins and execution goes beyond 60, hence the failure.

2015:02:09 10:25:00 radical.pilot.MainProcess: [INFO    ] ComputePilot '54d88aa2f8cdba2e5183143c' state changed from 'PendingActive' to 'Active'.
..
..
2015:02:09 11:25:07 radical.pilot.MainProcess: [ERROR   ] SAGA job state for ComputePilot 54d88aa2f8cdba2e5183143c is Canceled.
ashkurti commented 9 years ago

I have tested this scenario again after increasing the pilot walltime to 90 mins. I still encounter problems but in order to concentrate better on this problem I have raised another issue #146

ashkurti commented 9 years ago

Sorry the previous comment should refer to #137

CharlieLaughton commented 9 years ago

Hi Jordane,

Have you made any progress in identifying the problem with the Gromacs/LSDMap runs?

Best wishes,

Charlie

From: jp43 notifications@github.com<mailto:notifications@github.com> Reply-To: radical-cybertools/ExTASY reply@reply.github.com<mailto:reply@reply.github.com> Date: Wed, 11 Feb 2015 02:38:20 +0000 To: radical-cybertools/ExTASY ExTASY@noreply.github.com<mailto:ExTASY@noreply.github.com> Cc: CharlieLaughton charles.laughton@nottingham.ac.uk<mailto:charles.laughton@nottingham.ac.uk> Subject: Re: [ExTASY] Gromacs/LSDMap fail on STAMPEDE when num_iterations=5 (#144)

Sorry by radical pilot folder, I mean the corresponding pilot folder in your sandbox.

— Reply to this email directly or view it on GitHubhttps://github.com/radical-cybertools/ExTASY/issues/144#issuecomment-73825688.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.

Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system, you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.

jp43 commented 9 years ago

Hi Charlie,

I had a look at the folders that were generated in the pilot-folder. I think you are right, the error is due to the fact that during the LSDMap step of the fifth cycle, the number of lines in the file weight.w and in the .gro file do not match. I suspect that at some point when we transfer the output file of one step to the next one, we are copying the wrong files or something like that and this is why it fails. Now that symbolic links and backups are used, I really have troubles to understand what the different unit-folders correspond to and what/how are the files copied from one step to another. I talked with Vivek who told me he was going to look at it. Vivek, do you have any update?

Thank you Jordane

On Wed, Feb 18, 2015 at 5:06 AM, CharlieLaughton notifications@github.com wrote:

Hi Jordane,

Have you made any progress in identifying the problem with the Gromacs/LSDMap runs?

Best wishes,

Charlie

From: jp43 notifications@github.com<mailto:notifications@github.com> Reply-To: radical-cybertools/ExTASY <reply@reply.github.com<mailto: reply@reply.github.com>> Date: Wed, 11 Feb 2015 02:38:20 +0000 To: radical-cybertools/ExTASY <ExTASY@noreply.github.com<mailto: ExTASY@noreply.github.com>> Cc: CharlieLaughton <charles.laughton@nottingham.ac.uk<mailto: charles.laughton@nottingham.ac.uk>> Subject: Re: [ExTASY] Gromacs/LSDMap fail on STAMPEDE when num_iterations=5 (#144)

Sorry by radical pilot folder, I mean the corresponding pilot folder in your sandbox.

Reply to this email directly or view it on GitHub< https://github.com/radical-cybertools/ExTASY/issues/144#issuecomment-73825688

.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.

Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system, you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.

Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/144#issuecomment-74853943 .

Jordane PRETO

Rice University, Anderson Biological Lab, room 319 6100 Main street Houston, Texas, 77005-1892

jp43 commented 9 years ago

Hi Charlie,

By the way, have you tried Vivek's recommendation of increasing the walltime?

Thank you, Jordane

On Wed, Feb 18, 2015 at 5:06 AM, CharlieLaughton notifications@github.com wrote:

Hi Jordane,

Have you made any progress in identifying the problem with the Gromacs/LSDMap runs?

Best wishes,

Charlie

From: jp43 notifications@github.com<mailto:notifications@github.com> Reply-To: radical-cybertools/ExTASY <reply@reply.github.com<mailto: reply@reply.github.com>> Date: Wed, 11 Feb 2015 02:38:20 +0000 To: radical-cybertools/ExTASY <ExTASY@noreply.github.com<mailto: ExTASY@noreply.github.com>> Cc: CharlieLaughton <charles.laughton@nottingham.ac.uk<mailto: charles.laughton@nottingham.ac.uk>> Subject: Re: [ExTASY] Gromacs/LSDMap fail on STAMPEDE when num_iterations=5 (#144)

Sorry by radical pilot folder, I mean the corresponding pilot folder in your sandbox.

Reply to this email directly or view it on GitHub< https://github.com/radical-cybertools/ExTASY/issues/144#issuecomment-73825688

.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.

Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system, you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.

Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/144#issuecomment-74853943 .

Jordane PRETO

Rice University, Anderson Biological Lab, room 319 6100 Main street Houston, Texas, 77005-1892

vivek-bala commented 9 years ago

It indeed is the wrong weight file being staged into the analysis stage. Working on a fix.

vivek-bala commented 9 years ago

Solved in c3f682d750217bab7ffc35b160601d22cecf8706.

ashkurti commented 9 years ago

@vivek-bala Thank you so much for fixing this.

I did clean and reinstall everything from scratch at the end obtaining the following print of the extasy version:

[ExTASY-tools] ardita@poirot 278% python -c 'import radical.ensemblemd.extasy as extasy; print extasy.version'
0.1.3-beta-1-g2cdf59c

The workflow though still has problems. extasy.log is at https://gist.github.com/ashkurti/cc69fb468e466c12625a and the related radical.pilot folder is publicly accessible at /work/02998/ardi/radical.pilot.sandbox/pilot-54ef10bff8cdba43bf3d9b07.

The problem seems to be with not finding the required files as noticed at https://gist.github.com/ashkurti/cc69fb468e466c12625a#file-grlsd_numits5-L2429

ashkurti commented 9 years ago

I am also having problems on running a gromacs/lsdmap workflow using the default files. No back-up folder is created.

extasy.log at https://gist.github.com/ashkurti/b0e86de527b3f749559a

and related radical.pilot folder publicly accessible at /work/02998/ardi/radical.pilot.sandbox/pilot-54ef2700f8cdba52fb1ccf5f

ashkurti commented 9 years ago

Well, in the radical.pilot folders I noticed that there are not any folders that relate to computational units:

login3.stampede(40)$ ls pilot-54ef10bff8cdba43bf3d9b07
AGENT.STDERR  AGENT.STDOUT  default_bootstrapper.sh  radical-pilot-agent.py  staging_area  virtualenv-1.9.tar.gz
login3.stampede(41)$ ls pilot-54ef2700f8cdba52fb1ccf5f
default_bootstrapper.sh  radical-pilot-agent.py  staging_area

login3.stampede(42)$ ls pilot-54ef2700f8cdba52fb1ccf5f/staging_area/
config.ini  grompp.mdp  gro.py  post_analyze.py  pre_analyze.py  reweighting.py  run_analyzer.sh  run.py  select.py  spliter.py  topol.top
login3.stampede(43)$ ls pilot-54ef10bff8cdba43bf3d9b07/staging_area/
config.ini  grompp.mdp  gro.py  post_analyze.py  pre_analyze.py  reweighting.py  run_analyzer.sh  run.py  select.py  spliter.py  topol.top
ashkurti commented 9 years ago

This works for me now too! :+1: