Closed gaow closed 4 years ago
i'm not sure about bumping this. it seems like a fairly major issue?
On Mon, May 13, 2019 at 3:00 PM gaow notifications@github.com wrote:
Currently we introduced / replaced --ignore-errors with --keep-going to run, at module level, as much as it can be done for a benchmark with errors somewhere. But this is still limited because it does not go down to module instance level. For example, if we have this benchmark:
ABC, ADC
then if something is wrong with any instances in B, --keep-going will ensure that ADC completes, but not the other non-erroneous instances in B and the corresponding B*C down the road.
This is different from what one might expect for Make's --keep-going switch, and is result of the DSC "feature" that dependencies happen at module level. I do not find it overly restricting in practice, but we might want to improve it later. This will involve some substantial changes and should best be addressed / considered after release 1.0.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stephenslab/dsc/issues/194?email_source=notifications&email_token=AANXRRPSSUYMCVLQ23HKOU3PVHCFDA5CNFSM4HMS6YG2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GTQNOJQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AANXRRIB7D3XZPZJWFTIZATPVHCFDANCNFSM4HMS6YGQ .
@gaow What about the simpler pipeline
A*B*C
If one out of 10 instances of module B fails, for example, will dsc continue and run the C module instances?
@pcarbo not with the current implementation. And I see that can look a more serious limitation than I put it (agree with @stephens999 ). A change of behavior will pose challenge to how our DAG is currently designed. However, for the time being, as with Yuxin's application where we implemented a customized R utility function to handle internally return code of command tools and raise exceptions at module instance level, I think there might be a way to generalize this approach to kind of "muffle" errors thus tricking the task engine to think the runs are successful. I've to think more on how to implement it but I feel this might be a viable idea.
@gaow My suggestion is to keep the implementation as is, but change the description of the --keep-going
flag. Saying "Try to finish as much as possible the entire benchmark" is not accurate. You could say instead something like this: "When an error is encountered in a single module execution, continue running all module instances within the module before stopping."
Finally, I removed --keep-going
flag and added a new switch -e
for finer control of error behavior:
-e option How DSC responds to errors. "abort": stop all modules
immediately after an error occurs. "ignore": ignore
errors and try to complete the benchmark. "default":
stop modules with errors or has errors in their
upstream modules. (default: default)
Here is a DSC script to test it:
bad_method: R(if(a>2) stop('error'))
a: 1,2,3,4
$out: a
good_method: R()
a: 1,2,3,4
$out: a
score: R()
param: $out
$res: 0
DSC:
run: (good_method, bad_method) * score
In this DSC script, the bad_method
has 4 module instances but 2 of them will fail. good_method
has 4 module instances and they should all work.
default
dsc test.dsc -d all && dsc test.dsc
[####] Failed with 6 steps processed (13 jobs completed)
WARNING: Please examine stderr files below and/or run commands in green to reproducethe errors;
additional scripts upstream of the error can be found in test.scripts.html.
===========================================================================
ERROR: [be674fb7-56e9-46af-a643-4ca263544e65]: [bad_method]: [(id=f62ba0b310200d59, index=2)]:
Failed to execute Rscript .sos/bad_method_2_f9ff1b39.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_3.stderr
---------------------------------------------------------------------------
[(id=f62ba0b310200d59, index=3)]:
Failed to execute Rscript .sos/bad_method_3_3bf43bb6.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_4.stderr
---------------------------------------------------------------------------
[DSC]: Exits with 2 pending steps (output validation, score in pipeline #2)
the output:
[GW] tree test
test
├── bad_method
│ ├── bad_method_1.rds
│ ├── bad_method_2.rds
│ ├── bad_method_3.stderr
│ ├── bad_method_3.stdout
│ ├── bad_method_4.stderr
│ └── bad_method_4.stdout
├── good_method
│ ├── good_method_1.rds
│ ├── good_method_2.rds
│ ├── good_method_3.rds
│ └── good_method_4.rds
├── score
│ ├── good_method_1_score_1.rds
│ ├── good_method_2_score_1.rds
│ ├── good_method_3_score_1.rds
│ └── good_method_4_score_1.rds
├── test.conf.mpk
├── test.db
└── test.map.mpk
3 directories, 17 files
Here you see there are still some score
generated for good_method
module even though some bad_method
module instances are problematic.
abort
dsc test.dsc -d all && dsc test.dsc -e abort
] Failed with 4 steps processed (5 jobs completed)
WARNING: Please examine stderr files below and/or run commands in green to reproducethe errors;
additional scripts upstream of the error can be found in test.scripts.html.
===========================================================================
ERROR: [DSC]: [bad_method]: [(id=f62ba0b310200d59, index=2)]:
Failed to execute Rscript .sos/bad_method_2_b1f32513.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_3.stderr
---------------------------------------------------------------------------
[index=3]:
Failed to execute Rscript .sos/bad_method_3_428ac652.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_4.stderr
---------------------------------------------------------------------------
[DSC]: Exits with 3 pending steps (output validation, score in pipeline #2, score in pipeline #1)
and output files:
[GW] tree test
test
├── bad_method
│ ├── bad_method_1.rds
│ ├── bad_method_2.rds
│ ├── bad_method_3.stderr
│ ├── bad_method_3.stdout
│ ├── bad_method_4.stderr
│ └── bad_method_4.stdout
├── good_method
│ ├── good_method_1.stderr
│ ├── good_method_1.stdout
│ ├── good_method_2.stderr
│ ├── good_method_2.stdout
│ ├── good_method_3.stderr
│ ├── good_method_3.stdout
│ └── good_method_4.rds
├── test.conf.mpk
├── test.db
└── test.map.mpk
2 directories, 16 files
You see here there are no score
module at all because DSC aborted right after an error occurs.
ignore
dsc test.dsc -d all && dsc test.dsc -e ignore
[WARNING: Ignoring error from module bad_method (id=f62ba0b310200d59, index=2):
Failed to execute Rscript .sos/bad_method_2_b9771ccf.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_3.stderr
---------------------------------------------------------------------------.
WARNING: Ignoring error from module bad_method (id=f62ba0b310200d59, index=3):
Failed to execute Rscript .sos/bad_method_3_4b72e135.R
exitcode=1, workdir=/home/gaow/Documents/TempDir/03-Dec-2019, stderr=test/bad_method/bad_method_4.stderr
---------------------------------------------------------------------------.
#########] 9 steps processed (17 jobs completed)
INFO: DSC complete!
INFO: Elapsed time 17.389 seconds.
you see it is now trying its best to complete everything. The output:
test
├── bad_method
│ ├── bad_method_1.rds
│ ├── bad_method_2.rds
│ ├── bad_method_3.stderr
│ ├── bad_method_3.stdout
│ ├── bad_method_4.stderr
│ └── bad_method_4.stdout
├── good_method
│ ├── good_method_1.rds
│ ├── good_method_2.rds
│ ├── good_method_3.rds
│ └── good_method_4.rds
├── score
│ ├── bad_method_1_score_1.rds
│ ├── bad_method_2_score_1.rds
│ ├── good_method_1_score_1.rds
│ ├── good_method_2_score_1.rds
│ ├── good_method_3_score_1.rds
│ └── good_method_4_score_1.rds
├── test.conf.mpk
├── test.db
└── test.map.mpk
3 directories, 19 files
For score
it has complete results for good_method
but only 2 out of 4 results for bad_method
.
-e option How DSC responds to errors. "abort": stop all modules
immediately after an error occurs. "ignore": ignore
errors and try to complete the benchmark. "default":
stop modules with errors or has errors in their
upstream modules. (default: default)
@gaow Should "modules" read "module instances" here? Since "bad_method" is a module, I would expect no bad_method outputs to be generated with -e default
since an error was generated in one of the module instances.
It would be better if the default option were not simply "default". (Suppose we decided to change the default setting---it wouldn't be good if the option "default" was not the default option.) What about we change the settings to something like "stop-all", "ignore", "stop-errors-only".
Should "modules" read "module instances" here?
yes -- will change that
What about we change the settings to something like "stop-all", "ignore", "stop-errors-only".
Sure. But these candidates are not idea. Default behavior is to "keep going as much as possible while NOT skipping errors" ... What is your other suggestions? keep-going
seems reasonable? Or safe-ignore
? That means ignore the errors only for branches on DAG that don't have an error; this is more stringent than ignore
which will ignore all errors
Does the behaviour align well with the GNU Make settings? If so, adopting their terms would be a good approach.
Our default is indeed GNU Make's keep-going
so we can use that term. But GNU Make --ignore
is not very useful. Our --ignore
is GNU Make's -i -k
options. that is why I avoided aligning with their terminology.
--ignore+keep-going
?
So you do want to align with GNU Make? I think this is only going to create more confusion as people will have to understand GNU Make's terminology first. I think defining our own convention calling them ignore
and safe-ignore
is better.
Yes, that's what I'm saying—I think the analogy is useful enough here. And GNU Make is so widely used and well-documented that there are benefits to doing so.
But, yes, we must also make sure that we document these options well enough ourselves so that people don't have to review the GNU Make documentation as well (and you have already done this).
Currently we introduced / replaced
--ignore-errors
with--keep-going
to run, at module level, as much as it can be done for a benchmark with errors somewhere. But this is still limited because it does not go down to module instance level. For example, if we have this benchmark:then if something is wrong with any instances in
B
,--keep-going
will ensure thatA*D*C
completes, but not the other non-erroneous instances inB
and the correspondingB*C
down the road.This is different from what one might expect for
Make
's--keep-going
switch, and is result of the DSC "feature" that dependencies happen at module level. I do not find it overly restricting in practice, but we might want to improve it later. This will involve some substantial changes and should best be addressed / considered after release 1.0.