A better `--keep-going` behavior

gaow commented 5 years ago

Currently we introduced / replaced --ignore-errors with --keep-going to run, at module level, as much as it can be done for a benchmark with errors somewhere. But this is still limited because it does not go down to module instance level. For example, if we have this benchmark:

A*B*C, A*D*C

then if something is wrong with any instances in B, --keep-going will ensure that A*D*C completes, but not the other non-erroneous instances in B and the corresponding B*C down the road.

This is different from what one might expect for Make's --keep-going switch, and is result of the DSC "feature" that dependencies happen at module level. I do not find it overly restricting in practice, but we might want to improve it later. This will involve some substantial changes and should best be addressed / considered after release 1.0.

stephens999 commented 5 years ago

i'm not sure about bumping this. it seems like a fairly major issue?

On Mon, May 13, 2019 at 3:00 PM gaow notifications@github.com wrote:

Currently we introduced / replaced --ignore-errors with --keep-going to run, at module level, as much as it can be done for a benchmark with errors somewhere. But this is still limited because it does not go down to module instance level. For example, if we have this benchmark:

ABC, ADC

then if something is wrong with any instances in B, --keep-going will ensure that ADC completes, but not the other non-erroneous instances in B and the corresponding B*C down the road.

This is different from what one might expect for Make's --keep-going switch, and is result of the DSC "feature" that dependencies happen at module level. I do not find it overly restricting in practice, but we might want to improve it later. This will involve some substantial changes and should best be addressed / considered after release 1.0.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc/issues/194?email_source=notifications&email_token=AANXRRPSSUYMCVLQ23HKOU3PVHCFDA5CNFSM4HMS6YG2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GTQNOJQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AANXRRIB7D3XZPZJWFTIZATPVHCFDANCNFSM4HMS6YGQ .

pcarbo commented 5 years ago

@gaow What about the simpler pipeline

A*B*C

If one out of 10 instances of module B fails, for example, will dsc continue and run the C module instances?

gaow commented 5 years ago

@pcarbo not with the current implementation. And I see that can look a more serious limitation than I put it (agree with @stephens999 ). A change of behavior will pose challenge to how our DAG is currently designed. However, for the time being, as with Yuxin's application where we implemented a customized R utility function to handle internally return code of command tools and raise exceptions at module instance level, I think there might be a way to generalize this approach to kind of "muffle" errors thus tricking the task engine to think the runs are successful. I've to think more on how to implement it but I feel this might be a viable idea.

pcarbo commented 5 years ago

@gaow My suggestion is to keep the implementation as is, but change the description of the --keep-going flag. Saying "Try to finish as much as possible the entire benchmark" is not accurate. You could say instead something like this: "When an error is encountered in a single module execution, continue running all module instances within the module before stopping."

gaow commented 4 years ago

Finally, I removed --keep-going flag and added a new switch -e for finer control of error behavior:

  -e option             How DSC responds to errors. "abort": stop all modules
                        immediately after an error occurs. "ignore": ignore
                        errors and try to complete the benchmark. "default":
                        stop modules with errors or has errors in their
                        upstream modules. (default: default)

Here is a DSC script to test it:

bad_method: R(if(a>2) stop('error'))
  a: 1,2,3,4
  $out: a

good_method: R()
  a: 1,2,3,4
  $out: a

score: R()
  param: $out
  $res: 0

DSC:
  run: (good_method, bad_method) * score

In this DSC script, the bad_method has 4 module instances but 2 of them will fail. good_method has 4 module instances and they should all work.

`default`

dsc test.dsc -d all && dsc test.dsc

[####] Failed with 6 steps processed (13 jobs completed)
WARNING: Please examine stderr files below and/or run commands in green to reproducethe errors;
additional scripts upstream of the error can be found in test.scripts.html.
===========================================================================
ERROR: [be674fb7-56e9-46af-a643-4ca263544e65]: [bad_method]: [(id=f62ba0b310200d59, index=2)]: 
Failed to execute Rscript .sos/bad_method_2_f9ff1b39.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_3.stderr
---------------------------------------------------------------------------
[(id=f62ba0b310200d59, index=3)]: 
Failed to execute Rscript .sos/bad_method_3_3bf43bb6.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_4.stderr
---------------------------------------------------------------------------
[DSC]: Exits with 2 pending steps (output validation, score in pipeline #2)

the output:

[GW] tree test
test
├── bad_method
│   ├── bad_method_1.rds
│   ├── bad_method_2.rds
│   ├── bad_method_3.stderr
│   ├── bad_method_3.stdout
│   ├── bad_method_4.stderr
│   └── bad_method_4.stdout
├── good_method
│   ├── good_method_1.rds
│   ├── good_method_2.rds
│   ├── good_method_3.rds
│   └── good_method_4.rds
├── score
│   ├── good_method_1_score_1.rds
│   ├── good_method_2_score_1.rds
│   ├── good_method_3_score_1.rds
│   └── good_method_4_score_1.rds
├── test.conf.mpk
├── test.db
└── test.map.mpk

3 directories, 17 files

Here you see there are still some score generated for good_method module even though some bad_method module instances are problematic.

`abort`

dsc test.dsc -d all && dsc test.dsc -e abort

] Failed with 4 steps processed (5 jobs completed)
WARNING: Please examine stderr files below and/or run commands in green to reproducethe errors;
additional scripts upstream of the error can be found in test.scripts.html.
===========================================================================
ERROR: [DSC]: [bad_method]: [(id=f62ba0b310200d59, index=2)]: 
Failed to execute Rscript .sos/bad_method_2_b1f32513.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_3.stderr
---------------------------------------------------------------------------
[index=3]: 
Failed to execute Rscript .sos/bad_method_3_428ac652.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_4.stderr
---------------------------------------------------------------------------
[DSC]: Exits with 3 pending steps (output validation, score in pipeline #2, score in pipeline #1)

and output files:

[GW] tree test
test
├── bad_method
│   ├── bad_method_1.rds
│   ├── bad_method_2.rds
│   ├── bad_method_3.stderr
│   ├── bad_method_3.stdout
│   ├── bad_method_4.stderr
│   └── bad_method_4.stdout
├── good_method
│   ├── good_method_1.stderr
│   ├── good_method_1.stdout
│   ├── good_method_2.stderr
│   ├── good_method_2.stdout
│   ├── good_method_3.stderr
│   ├── good_method_3.stdout
│   └── good_method_4.rds
├── test.conf.mpk
├── test.db
└── test.map.mpk

2 directories, 16 files

You see here there are no score module at all because DSC aborted right after an error occurs.

`ignore`

dsc test.dsc -d all && dsc test.dsc -e ignore

[WARNING: Ignoring error from module bad_method (id=f62ba0b310200d59, index=2): 
Failed to execute Rscript .sos/bad_method_2_b9771ccf.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_3.stderr
---------------------------------------------------------------------------.
WARNING: Ignoring error from module bad_method (id=f62ba0b310200d59, index=3): 
Failed to execute Rscript .sos/bad_method_3_4b72e135.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_4.stderr
---------------------------------------------------------------------------.
#########] 9 steps processed (17 jobs completed)
INFO: DSC complete!
INFO: Elapsed time 17.389 seconds.

you see it is now trying its best to complete everything. The output:

test
├── bad_method
│   ├── bad_method_1.rds
│   ├── bad_method_2.rds
│   ├── bad_method_3.stderr
│   ├── bad_method_3.stdout
│   ├── bad_method_4.stderr
│   └── bad_method_4.stdout
├── good_method
│   ├── good_method_1.rds
│   ├── good_method_2.rds
│   ├── good_method_3.rds
│   └── good_method_4.rds
├── score
│   ├── bad_method_1_score_1.rds
│   ├── bad_method_2_score_1.rds
│   ├── good_method_1_score_1.rds
│   ├── good_method_2_score_1.rds
│   ├── good_method_3_score_1.rds
│   └── good_method_4_score_1.rds
├── test.conf.mpk
├── test.db
└── test.map.mpk

3 directories, 19 files

For score it has complete results for good_method but only 2 out of 4 results for bad_method.

pcarbo commented 4 years ago

  -e option             How DSC responds to errors. "abort": stop all modules
                        immediately after an error occurs. "ignore": ignore
                        errors and try to complete the benchmark. "default":
                        stop modules with errors or has errors in their
                        upstream modules. (default: default)

@gaow Should "modules" read "module instances" here? Since "bad_method" is a module, I would expect no bad_method outputs to be generated with -e default since an error was generated in one of the module instances.

It would be better if the default option were not simply "default". (Suppose we decided to change the default setting---it wouldn't be good if the option "default" was not the default option.) What about we change the settings to something like "stop-all", "ignore", "stop-errors-only".

gaow commented 4 years ago

Should "modules" read "module instances" here?

yes -- will change that

What about we change the settings to something like "stop-all", "ignore", "stop-errors-only".

Sure. But these candidates are not idea. Default behavior is to "keep going as much as possible while NOT skipping errors" ... What is your other suggestions? keep-going seems reasonable? Or safe-ignore? That means ignore the errors only for branches on DAG that don't have an error; this is more stringent than ignore which will ignore all errors

pcarbo commented 4 years ago

Does the behaviour align well with the GNU Make settings? If so, adopting their terms would be a good approach.

gaow commented 4 years ago

Our default is indeed GNU Make's keep-going so we can use that term. But GNU Make --ignore is not very useful. Our --ignore is GNU Make's -i -k options. that is why I avoided aligning with their terminology.

pcarbo commented 4 years ago

--ignore+keep-going?

gaow commented 4 years ago

So you do want to align with GNU Make? I think this is only going to create more confusion as people will have to understand GNU Make's terminology first. I think defining our own convention calling them ignore and safe-ignore is better.

pcarbo commented 4 years ago

Yes, that's what I'm saying—I think the analogy is useful enough here. And GNU Make is so widely used and well-documented that there are benefits to doing so.

But, yes, we must also make sure that we document these options well enough ourselves so that people don't have to review the GNU Make documentation as well (and you have already done this).

stephenslab / dsc