Always done with exceeding the observation limits

yj-Tang commented 3 months ago

Hi Max, I'm currently working on training a reinforcement learning policy using urdfenvs, and I've encountered an unexpected behavior. Here's what I've applied and observed:

When I provide an action (joint velocity) that exceeds the action space, it's clipped to fit within the defined limits.
However, after applying this clipped action, I noticed that the environment is always terminated due to the next observation exceeding the observation space.
Specifically, the observed velocity of the joint in the next state is larger than the allowed limit, despite the input velocity being clipped.

I would expect the observed velocity to match the clipped input velocity. Is this the intended behavior? Or do you have any suggestions on how to keep the next observation in the observation limit? Looking forward to your reply. Thanks in advance.

Best regards, Yujie

maxspahn commented 3 months ago

Hi @yj-Tang,

thanks for pointing out this issue.

Could you provide me with an example script?

First of all, you can turn off the limit checking by setting the argument observation_checking to false. https://github.com/maxspahn/gym_envs_urdf/blob/4b80c3146624cafc56e01ed1158c5f5ab167bd54/urdfenvs/urdf_common/urdf_env.py#L64C1-L65C1

The velocity that you pass as action is the reference velocity that the physics engine's PID controller will try to match using torques as the underlying input, so you will never match the exact action as intended.

Best, Max.

yj-Tang commented 3 months ago

Thanks for your reply. Here is the script, which is revised from https://github.com/maxspahn/gym_envs_urdf/blob/develop/examples/point_robot_reward.py

import numpy as np
from urdfenvs.urdf_common.urdf_env import UrdfEnv
from urdfenvs.robots.generic_urdf import GenericUrdfReacher
from urdfenvs.sensors.full_sensor import FullSensor
from urdfenvs.scene_examples.goal import goal1
from urdfenvs.urdf_common.reward import Reward

N_STEPS=1000

class InverseDistanceDenseReward(Reward):
    def calculate_reward(self, observation: dict) -> float:
        goal = observation['robot_0']['FullSensor']['goals'][1]['position']
        position = observation['robot_0']['joint_state']['position']
        return 1.0/np.linalg.norm(goal-position)

robots = [
    GenericUrdfReacher(urdf="pointRobot.urdf", mode="vel"),
]
env= UrdfEnv(
    dt=0.01,
    robots=robots,
    render=True,
)
env.add_goal(goal1)

# %%
sensor = FullSensor(['position'], ['position', 'size'], variance=0.0)
env.add_sensor(sensor, [0])
env.set_spaces()
env.set_reward_calculator(InverseDistanceDenseReward())
defaultAction = np.array([4.5, -4.0, 4.0])
pos0 = np.array([0.0, 0.1, 0.0])
vel0 = np.array([1.0, 0.0, 0.0])

# %%
ob, info = env.reset(pos=pos0, vel=vel0)
env.shuffle_goals()
print(f"Initial observation : {ob}")
gain_action = 1

history = []
for i in range(N_STEPS):
    # Simple p-controller for goal reaching
    # goal = ob['robot_0']['FullSensor']['goals'][1]['position']
    # robot_position = ob['robot_0']['joint_state']['position']
    # action = gain_action * (goal - robot_position)

    clipped_action = np.clip(defaultAction, env.action_space.low, env.action_space.high)
    ob, reward, terminated, truncated, info = env.step(clipped_action)
    print(f"Reward : {reward}")
    # In observations, information about obstacles is stored in ob['obstacleSensor']
    history.append(ob)

    print(f"Step {i} : {ob}")
    if terminated or truncated:
        break

env.close()

yj-Tang commented 3 months ago

Thanks for pointing out "physics engine's PID controller will try to match using torques as the underlying input". My solution now is to clip the planned action to a slightly smaller value

maxspahn commented 3 months ago

That seems a reasonable approach, otherwise you can also rely on not checking the observation limits, or at least not breaking when one is encountered.

Can you close the issue if you are fine with my response?

yj-Tang commented 3 months ago

Problem is solved, thank you!

maxspahn / gym_envs_urdf

Always done with exceeding the observation limits #270