radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

TSU, some replicas finished but recognized as "Failed" #43

Closed haoyuanchen closed 8 years ago

haoyuanchen commented 8 years ago

In the running log of repex on my local PC, something like this

[INFO ] ComputeUnit 'unit.000137' state changed to Failed.

showed up several times. However, when I go to the sandbox on Stampede I found that those replica actually successfully finished.

marksantcroos commented 8 years ago

I assume "successfully" is here defined as "there are output files" ?

haoyuanchen commented 8 years ago

Also, repex didn't crash but kept on running with ignoring those "failed" replicas in the exchange step, which is what we expected.

However, when I ran TUU on Gordon, some replica failed and repex just stalled and finally crashed.

haoyuanchen commented 8 years ago

@marksantcroos I actually checked the output files and I found that the MD finished successfully.

marksantcroos commented 8 years ago

Is there anything in STDERR (or STDOUT) that might point to why RP thinks it fails. RP looks at the return value of the command line, and thats either 0 or not. Not that much that can go wrong, especially not on your local PC. Is this an MPI task or a regular task?

haoyuanchen commented 8 years ago

The STDERR file for all replicas (no matter finished or failed) says

Resetting modules to system default

Some finished replicas have empty STDOUT files while some other finished replicas have non-empty STDOUT files. One example is

Success copying history_name to staging_area! Got history data for self! Waiting for replica: 6 Success processing replica: 6 got history data for other replicas in current group!

All failed replicas have non-empty STDOUT files. One example is:

Got history data for self! Success processing replica: 5 Success processing replica: 6

Each replica just use 1 core, not MPI.

marksantcroos commented 8 years ago

Can it be that the application actually returns non-zero even on success?

What is the return code that is recorded in the RP logs?

haoyuanchen commented 8 years ago

It seems so. Also, the STDOUT files are different among finished replicas and among failed replicas too.

I didn't see any "return codes" in the log. Maybe I need to turn on RADICAL_PILOT_VERBOSE and run again?

marksantcroos commented 8 years ago

It seems so. Also, the STDOUT files are different among finished replicas and among failed replicas too.

Who is the author of the application? Sorry for my ignorance.

I didn't see any "return codes" in the log. Maybe I need to turn on RADICAL_PILOT_VERBOSE and run again?

Yes, please.

antonst commented 8 years ago

showed up several times. However, when I go to the sandbox on Stampede I found that those replica actually successfully finished

Most likely reason is that output files, which is not required for subsequent runs were not obtained / copied. This results in CU's state being marked as Failed. Can you please include version tag of repex code you are using?

antonst commented 8 years ago

However, when I ran TUU on Gordon, some replica failed and repex just stalled and finally crashed.

At which scale this behaviour on Gordon is observed?

antonst commented 8 years ago

Is there anything in STDERR (or STDOUT) that might point to why RP thinks it fails

I don't think unsuccessful data movement tasks are showing in STDERR / STDOUT, are they?

antonst commented 8 years ago

Is this an MPI task or a regular task?

by default TUU does not have any MPI tasks.

antonst commented 8 years ago

Some finished replicas have empty STDOUT files while some other finished replicas have non-empty STDOUT files

Yes, since this is 3d REMD, we have a mix of Amber tasks and exchange tasks.

marksantcroos commented 8 years ago

by default TUU does not have any MPI tasks.

If I only knew what all this TUU and TSU is about :P

marksantcroos commented 8 years ago

I don't think unsuccessful data movement tasks are showing in STDERR / STDOUT, are they?

That is true. I actually didn't consider failing output data, but that is a good point. As it currently stands, its not easy to determine from which state we failed. Something to think about.

andre-merzky commented 8 years ago

the units state history should give that info I think. If that is not sufficiently detailed, we can add more finegraded data in unit.log. Note though that both can be out of order if multiple update workers are used in the agent, so you need to order by timestamp...

marksantcroos commented 8 years ago

the units state history should give that info I think. If that is not sufficiently detailed, we can add more finegraded data in unit.log. Note though that both can be out of order if multiple update workers are used in the agent, so you need to order by timestamp...

The implicit keyword here was easy ...

marksantcroos commented 8 years ago

the units state history should give that info I think.

And this is even arguable, as we always go into output staging state regardless of the result of the execution IIRC.

antonst commented 8 years ago

Can it be that the application actually returns non-zero even on success?

repex - no. amber - I would say no as well, but my experience with amber is limited.

haoyuanchen commented 8 years ago

@AntonsT version is 0.2-feature-tuu-opt5-6616728- On Gordon I saw that happen even with 8 replicas, but not with the current version.

haoyuanchen commented 8 years ago

Yes, since this is 3d REMD, we have a mix of Amber tasks and exchange tasks.

So if the STDOUT is empty, then it's a finished Amber task or exchange task?

antonst commented 8 years ago

So if the STDOUT is empty, then it's a finished Amber task or exchange task?

For Amber task typically STDOUT should be empty. btw. is this problem still popping up? If not can I close the ticket then?

haoyuanchen commented 8 years ago

is this problem still popping up? If not can I close the ticket then?

It seems to be resolved.