Help with reinforcement learning and rule templates

leolellisr commented 2 months ago

Hello everyone, everything good?

Can you help me?

I'm trying to implement reinforcement learning and rule templates with jsoar.

However, I am unable to get past initialization.

Could you check my code if I'm doing something wrong?

Is there any example of using rule templates?

Thank you for your attention!

Code:

rl --set learning on # enable RL
indifferent-selection -g # use epsilon-greedy decision making
indifferent-selection --epsilon 0.1 # 10% deviation from greedy

# init
sp {marta*propose*init-rl
    (state <s> ^superstate nil
                       -^name)
    -->
    (<s> ^operator <o> + )
    (<o> ^name init-rl)
    }

 sp {marta*apply*init-rl
    (state <s> ^operator.name init-rl)           
    -->
    (<s> ^action A0
         ^reward-link <rl>
         ^value 1)
    (<rl> ^reward <rw>)
    (<rw> ^value 1)
 }

# rule template
sp {marta*rule*template 
    :template 
    (state <s> ^operator <o> + 
               ^reward-link <rl>)
    (<rl> ^reward <rw>)
    (<rw> ^value <v>)
    --> 
    (<s> ^operator <o> = <v>) 
}

marinier commented 2 months ago

Your operator proposal rule marta*propose*init-rl tests that there is no name (meaning, it will unmatch when name is added to the state), but your apply rule marta*apply*init-rl does not create a name on the state. Thus your proposal rule never unmatches and you get stuck in an operator no-change impasse.

Additionally, you should not be creating a reward-link on the state in your apply rule. The reward-link already exists and is created by Soar.

You could try changing your apply rule like this:

 sp {marta*apply*init-rl
    (state <s> ^operator.name init-rl
               ^reward-link <rl>)           
    -->
    (<s> ^action A0
         ^value 1
         ^name myName)
    (<rl> ^reward <rw>)
    (<rw> ^value 1)
 }

When I run your program now it applies the rule once and then state no-changes, which is expected since you have no other operators. It also gives the warning Ignoring rl*marta*rule*template*2 because it is a duplicate of rl*marta*rule*template*1 -- this is normal because the template rule tries to generate the same rule twice.

However, I think these rules are still not what you want. Because the value doesn't exist on the state until after the apply rule fires, marta*rule*template doesn't fire until after the operator is selected. But it is a preference rule and is supposed to influence whether the operator is selected or not.

Note you will be able to get a lot more help from people who have used Soar's RL much more recently than me if you email the Soar help mailing list. See how here: https://soar.eecs.umich.edu/SoarSupport/MailingLists

leolellisr commented 2 months ago

@marinier thank you for the reply.

I think I understand why it was stuck before, as I didn't have a name for the state in apply*init-rl, it went back to propose, right?

But I think I still don't understand very well how rule templates work.

Giving a little more context, I have an agent with 20 actions. I would like, for each action, to create a rule template that checks the current state data and then proposes the RL operator. Could you shed some light on this?

I request to enter the mailing lists. When I when i get approved, I will also send an email to the suggested list. Thank you very much!

My current code:

rl --set learning on # enable RL
indifferent-selection -g # use epsilon-greedy decision making
indifferent-selection --epsilon 0.1 # 10% deviation from greedy

# init
sp {marta*propose*init-rl
    (state <s> ^superstate nil
                       -^name)
    -->
    (<s> ^operator <o> + )
    (<o> ^name init-rl)
    }

 sp {marta*apply*init-rl
    (state <s> ^operator.name init-rl
               ^reward-link <rl>)           
    -->
    (<s> ^action A0
         ^value 1
         ^name action0)
    (<rl> ^reward <rw>)
    (<rw> ^value 1)
 }

# rule template
sp {marta*rule*template 
    :template 
    (state <s> ^operator <o> + 
               ^reward-link <rl>)
    (<rl> ^reward <rw>)
    (<rw> ^value <v>)
    --> 
    (<s> ^operator <o> = <v>) 
}

soartech / jsoar

Help with reinforcement learning and rule templates #142