minerllabs / minerl

MineRL Competition for Sample Efficient Reinforcement Learning - Python Package
http://minerl.io/docs/
Other
654 stars 151 forks source link

How to customize the reward function with reward handler? #742

Closed huangdi95 closed 5 months ago

huangdi95 commented 6 months ago

Hi, I want to customize the reward function. I modified the create_rewardables method in HumanSurvival like this:

class HumanSurvival(HumanControlEnvSpec):
    def __init__(self, *args, load_filename=None, **kwargs):
        if "name" not in kwargs:
            kwargs["name"] = "MineRLHumanSurvival-v0"

    # ......

    def create_rewardables(self) -> List[Handler]:
        return [
            handlers.RewardForCollectingItems([
                dict(type="log", amount=1, reward=1.0),
            ])
        ]

It is supposed to get 1.0 reward for each "log" collected.

I run the agent with the original code of VPT's run_agent.py. The model is 2x.model and the weight is rl-from-foundation-2x.weights downloaded from VPT

However, the reward is always 0 even if the agent can collect tons of logs.

I notice that in _multiagent.py where the environment is stepped, the reward is 0 and it is said that

comms.send_message(instance.client_socket, step_message.encode())

# Receive the observation.
obs = comms.recv_message(instance.client_socket)

# Receive reward done and sent.
reply = comms.recv_message(instance.client_socket)
reward, done, sent = struct.unpack("!dbb", reply)

# TODO: REFACTOR TO USE REWARD HANDLERS INSTEAD OF MALMO REWARD.
done = (done == 1)

So is the reward handler not supported yet? Then how should I customize the reward function?

Could someone help me with this?

Miffyli commented 5 months ago

Heya. This is unfortunately bit messy/confusing part of the code. Those reward handlers etc are indeed not implemented in the newest version of the MineRL (1.0.x), and are remnants of stuff that was left forking from earlier versions, which have full support for them. Your best bet is to not modify the MineRL environment but to add environment wrappers on top of it / write your own reward signals on top of the observations.

huangdi95 commented 5 months ago

Thank you for your reply!

I see. I found that (1) the agent start handler works so I can customize the inventory of the agent while the reward and quit handler fails. (2) The reward handler is passed to malmo in XML like:

        ......
        <AgentStart>

            <LowLevelInputs>true</LowLevelInputs>

            <GuiScale>1.0</GuiScale>

            <GammaSetting>2.0</GammaSetting>

            <FOVSetting>70.0</FOVSetting>

            <FakeCursorSize>16</FakeCursorSize>

            <Inventory>

                <InventoryObject slot="0" type="dirt" quantity="1"/>

                <InventoryObject slot="1" type="diamond_axe" quantity="1"/>

                <InventoryObject slot="2" type="crafting_table" quantity="1"/>

                <InventoryObject slot="3" type="stick" quantity="1"/>

                <InventoryObject slot="4" type="oak_log" quantity="1"/>

            </Inventory>

        </AgentStart>

        <AgentHandlers>
        ......

            <!--  Rewards   -->

            <RewardForCraftingItem Sparse="true">
                <Item reward="20" type="wooden_pickaxe" amount="1"/>
            </RewardForCraftingItem>

            <!-- Additional Agent Handlers like quitting -->

            <AgentQuitFromPossessingItem>

                <Item type="stone_pickaxe" amount="1"/>

            </AgentQuitFromPossessingItem>
        </AgentHandlers>
        ......

but nothing happens and I can't get the right reward. I'm confused about why the AgentStart works but Reward and Quit fail (I have tried different types of handlers like RewardForPossessingItem and RewardForCollectingItem).

Basically, I want to finetune VPT base model with RL. Can you give me some suggestions on this? Can I solve this problem by downgrading minerl to v0.4.4 (can VPT run on v0.4.4)? Or should I just go to wrap my own reward functions?

Miffyli commented 5 months ago

The start inventory handler is probably supported in the code, so indeed it works, but not the reward handler (i.e., none of the reward handler specs specified by Malmö do not work, as v1.x does not use Malmö code at all).

I'd implement reward signals outside the MineRL code by observing how obs or info entries change (they contain lots of information about distance travelled and items obtained). If you want more refined control, however, check out MineDojo, which is based on MineRL v0.4.x version but has more control over things.

huangdi95 commented 5 months ago

Got it. I'll have a try. Thank you!