rddlsim blows its stack when running PROST examples standalone

I raised this in the PROST github but @thomaskeller79 suggested I post it here.

I find certain things that run OK under PROST do not run for me if I run a single rddlsim execution. If you have rddlsim and prost under a common parent, the following results in the stack depth being exceeded:

./run rddl.sim.Simulator ../prost/testbed/benchmarks/push-your-luck-2018 rddl.policy.RandomBoolPolicy push-your-luck_inst_mdp__01

Thomas suggested I check out ippc2018 in rddlsim to see if it changed the outcome, but as far as I could tell, it didn't.

The same was true for wildfire-2018 -- both of these use a trick of turning interm-fluents into explicit level enumerated values.

@thomaskeller79 thought perhaps there is a bad interaction with rddl.policy.RandomBoolPolicy?

Just a small comment: the rddl.competition.Server component can handle the instance, which is why I assume its not a syntax issue of the rddl file but has something to do with the RandomBoolPolicy

So, I just used RandomBoolPolicy based on some other examples. What would be the appropriate command to run it the same way that prost uses it, but just a single execution? Or is the issue that if you aren't feeding in commands through the server instance, you need some other policy to tell it how to decide on actions to take?

I feel like there is a certain disconnect in the way rddlsim runs standalone and via prost -- there are things that prost runs ok with but cause invariant failures in rddlsim alone, and vice-versa. I think the problem is that the things that check invariants consider just the current state. For instance, when picking a random action and a random value for that action, rddlsim would try the action and if it fails, back up and try another. If it can pick every possible action/value over a long horizon and it hits an exception it has to back up and try the next thing ... deep stack and potentially exponential runtime. With prost I have the opposite problem, where it decides that something leads down a bad path and it aborts rather than dropping that path rather than looking only at other options. You've said if it goes down a bad path, it is a badly formed domain. Is there documentation or anything that can better explain how one forms such paths properly?

I've seen examples in the couple of domains that added the intermediate fluent manually.. things like this:

        // enforce proceed-interm-level at all levels but @level0
        (current-level == @level1) => proceed-interm-level;

        // forbid proceed-interm-level at all other levels but @level0
        proceed-interm-level => (current-level == @level1);

But when I've tried things like this for parameterized actions, it hasn't gone well at all. I hit prost complaining about parse-time errors that are actually violations of at least one precondition for all actions. I added the change to get a human-readable version of the preconditions, and my own changes to say "action N failed recondition M" so I can see which preconditions are tripping it up -- but not the state it is in at the time to decide that it should fail. That is, rddlsim will say "invariant failed, and the current state is x := true and y := false" but prost simply says "nope, not gonna happen".

You mentioned -log VERBOSE as an option, by the way, and maybe that would help here, but I didn't figure out a place to put that option such that I saw a difference in the output. Where does the option go?

Thanks much.

And yeah, I see I made this comment in the rddlsim issue, and it is more appropriate someplace in the prost github or email ... whoops.

So, I just used RandomBoolPolicy based on some other examples. What would be the appropriate command to run it the same way that prost uses it, but just a single execution? Or is the issue that if you aren't feeding in commands through the server instance, you need some other policy to tell it how to decide on actions to take?

I assume that the RandomBoolPolicy samples a random action, ignoring action applicability. I don't know anything about the other policies implemented in rddlsim, @ssanner can you say something about that?

I feel like there is a certain disconnect in the way rddlsim runs standalone and via prost -- there are things that prost runs ok with but cause invariant failures in rddlsim alone, and vice-versa. I think the problem is that the things that check invariants consider just the current state. For instance, when picking a random action and a random value for that action, rddlsim would try the action and if it fails, back up and try another. If it can pick every possible action/value over a long horizon and it hits an exception it has to back up and try the next thing ... deep stack and potentially exponential runtime. With prost I have the opposite problem, where it decides that something leads down a bad path and it aborts rather than dropping that path rather than looking only at other options. You've said if it goes down a bad path, it is a badly formed domain. Is there documentation or anything that can better explain how one forms such paths properly?

No, there is no such documentation. Prost expects that there is at least one applicable action in every reachable state, which is an assumption that is made because it is unclear what happens in a state without an applicable action (and because there are different possibilities to define the semantics of this case, something no one ever did because it wasn't necessary). As a general rule, you can avoid this by adding a dummy action to your domain that is applicable whenever no other action is applicable. Of course, it is easier to say this than implement this, because it requires that you know the set of states where no action is applicable, and it requires that it is possible to describe that set compactly with a logical formula.

I've seen examples in the couple of domains that added the intermediate fluent manually.. things like this:

        // enforce proceed-interm-level at all levels but @level0
        (current-level == @level1) => proceed-interm-level;

        // forbid proceed-interm-level at all other levels but @level0
        proceed-interm-level => (current-level == @level1);

These are necessary because of the interm-fluent compilation. If you have more than 1 level because you have interm-fluents at higher levels, it is probably best to replace these with:

(current-level ~= @level0) => proceed-interm-level; proceed-interm-level => (current-level ~= @level0);

With these, you basically say that in every state where the interm-fluent is different from 0, you have to apply the artifical "proceed-interm-level" action (which is only there to allow the evaluation of interm-fluents as state-fluents). If all your other constraints now only talk about states where (current-level == @level0) and you make sure that there is an applicable action for all possible assignments for all other variables, you should be fine.

But when I've tried things like this for parameterized actions, it hasn't gone well at all. I hit prost complaining about parse-time errors that are actually violations of at least one precondition for all actions. I added the change to get a human-readable version of the preconditions, and my own changes to say "action N failed recondition M" so I can see which preconditions are tripping it up -- but not the state it is in at the time to decide that it should fail. That is, rddlsim will say "invariant failed, and the current state is x := true and y := false" but prost simply says "nope, not gonna happen".

I agree this can be useful. Feel free to open an issue for prost that handles this (don't make it too general, but describe exactly the case where prost crashes because there is no applicable action; then there is a chance that I will find the time to actually implement this).

You mentioned -log VERBOSE as an option, by the way, and maybe that would help here, but I didn't figure out a place to put that option such that I saw a difference in the output. Where does the option go?

E.g., ./prost.py elevators_inst_mdp__1 [PROST -log VERBOSE -s 1 -se [IPC2014]]

Note that this only affects logging of information in the search component, we want to add this to the parser in issue 114 .

Thanks much.

So, I just used RandomBoolPolicy based on some other examples. What would be the appropriate command to run it the same way that prost uses it, but just a single execution? Or is the issue that if you aren't feeding in commands through the server instance, you need some other policy to tell it how to decide on actions to take?

I assume that the RandomBoolPolicy samples a random action, ignoring action applicability. I don't know anything about the other policies implemented in rddlsim, @ssanner can you say something about that?

Yeah, that is the part that confuses me. It seems like in the vanilla rddlsim, it will try a boolean generator like that, and if it fails, it will back out. With Prost, it feels like you assume that if you go down a path where you then find yourself unable to generate an action that satisfies the preconditions, you abort.

... No, there is no such documentation. Prost expects that there is at least one applicable action in every reachable state, which is an assumption that is made because it is unclear what happens in a state without an applicable action (and because there are different possibilities to define the semantics of this case, something no one ever did because it wasn't necessary). As a general rule, you can avoid this by adding a dummy action to your domain that is applicable whenever no other action is applicable. Of course, it is easier to say this than implement this, because it requires that you know the set of states where no action is applicable, and it requires that it is possible to describe that set compactly with a logical formula.

I actually tried such a dummy action, but it didn't help. Again, more likely operator error than anything else.

... These are necessary because of the interm-fluent compilation. If you have more than 1 level because you have interm-fluents at higher levels, it is probably best to replace these with:

(current-level ~= @level0) => proceed-interm-level; proceed-interm-level => (current-level ~= @level0);

Yeah, I basically did that. Is the notion that proceed-interm-level is a default action, such that it's not good enough to say "don't do real actions unless @level0, but do proceed-interm at other levels so you have something to do"?

I agree this can be useful. Feel free to open an issue for prost that handles this (don't make it too general, but describe exactly the case where prost crashes because there is no applicable action; then there is a chance that I will find the time to actually implement this).

OK, thanks.

...

E.g., ./prost.py elevators_inst_mdp__1 [PROST -log VERBOSE -s 1 -se [IPC2014]]

Thanks.

ssanner / rddlsim

rddlsim blows its stack when running PROST examples standalone #11