[RLlib] ARS not respecting gym Box bounds in training or testing

andrew-thought commented 2 years ago

What happened + What you expected to happen

I am working with FinRL-Meta (link in reproduction project). I wanted to try using RLlib’s ARS implementation with the same codebase, but the ARS model in both training and testing (using compute_single_action() for testing and not sure what method ARS is using to produce continuous actions for training, is it using compute_actions, or maybe a sampler function) is producing actions outside my environment’s defined action_space.

In this case, the environment defines action_space:

self.action_space = spaces.Box(low=-3, high=3, shape=(len(self.assets),)) # len(self.assets) always equals 1 currently

We sometimes see actions as far out-of-bounds as ± 60.

Additionally, ARS starts with actions around the 0.0 mark, but then increasingly grows to the bounds and then exceeds the bounds to where 99% for all actions are out of bounds as training progresses.

We expect ARS actions to respect the Box action space range.

What I have tried normalize actions = true and false, no change different exploration functions, disable exploration, no change pass clip_actions = true and false to compute_actions(), no change Upgrading Ray to 2.0.0, no change Tried PPO, PPO respects Box bounds We tried adding unsquash code to the ARS compute_single_action() but i suspect since ARS is probably not normalizing action space, this didn't appear to function properly

We see that ARS is overriding compute_single_actions() and probably doesn't contain normalize, unsquash or clip code to process those configuration flags. Since ARS doesn't override compute_actions() which does contain normalize, unsquash and clip code, ARS is probably not using that for training. We tried to debug that in training but we were not able to debug into the workers.

Sample of training output: Worker pid=176481) NEW STEP>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Worker pid=176481) x: 9.205153 (Worker pid=176481) action: 2 (Worker pid=176481) OOB (Worker pid=176481) NEW STEP>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Worker pid=176481) x: 6.871564 (Worker pid=176481) action: 2 (Worker pid=176481) OOB (Worker pid=176481) NEW STEP>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Worker pid=176481) x: 7.683341 (Worker pid=176481) action: 2 (Worker pid=176481) OOB (Worker pid=176481) NEW STEP>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Worker pid=176481) x: 8.150882 (Worker pid=176481) action: 2 (Worker pid=176481) OOB (Worker pid=176481) NEW STEP>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Worker pid=176481) x: 7.7060976 (Worker pid=176481) action: 2 (Worker pid=176481) OOB (Worker pid=176481) NEW STEP>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Worker pid=176481) x: 4.4323587 (Worker pid=176481) action: 2 (Worker pid=176481) OOB

Versions / Dependencies

Versions: ray = 1.12.0 gym = 0.21.0 python = 3.7

Reproduction script

I have a reproduction project (https://github.com/imnotpete/ARS-OOB-Reproduction), in a notebook. It’s based on my latest pull of FinRL-Meta, without any future changes.

Issue Severity

High: It blocks me from completing my task.

ArturNiederfahrenhorst commented 2 years ago

Possibly related

imnotpete commented 2 years ago

Yes, these are related.

InTheta commented 1 year ago

Is there any updates on this? I am using PPO and i am losing mind with this.

    self.action_space = spaces.Dict({
            'type': spaces.Discrete(self.discrete_features_count),  # specific action type
            'value1': spaces.Box(low=0.01, high=1.0, shape=(1,), dtype=np.float32), 
            'value2': spaces.Discrete(300) 
        })

    def step(self, action):
        try:
            reward = 0

            # Clip the action to be within the valid range
            action['type'] = np.clip(action['type'], 0, self.action_space['type'].n - 1).astype(int)
            action['value1'] = np.clip(action['value1'], 0.01, 1.0)

            # Extract the parts of the action
            normal_action = action['type']
            self.value1 = action['value1'][0]  
            self.value2 = action['value2']

always getting this error at the end

raise e.with_traceback(filtered_tb) from None File "/home/will/miniconda3/envs/ray-tensor/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7262, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__SparseSoftmaxCrossEntropyWithLogits_device_/job:localhost/replica:0/task:0/device:CPU:0}} Received a label value of 300 which is outside the valid range of [0, 300). Label values: 300 [Op:SparseSoftmaxCrossEntropyWithLogits]

It seems to be mixing the index with the discrete value? Please help!?!

andreipauliuc-ads commented 10 months ago

Hello, are there any updates on this issue? I get the same error with PPO and cannot figure this out. In my case it's an action space gym.spaces.Discrete(8) and always the training stops due to this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of 8 which is outside the valid range of [0, 8). Label values: 8 [[{{node default_policy_wk1/SparseSoftmaxCrossEntropyWithLogits_18/SparseSoftmaxCrossEntropyWithLogits}}]]

ray-project / ray