Open Wangshengyang2004 opened 6 months ago
Ask GPT-4 for revision:
Constant:
• task_obs_code_string: The whole piece of code.
• task_description: The drone needs to go three stages movement. The drone starts at point (0,0,0), where the red ball locates for better visualization. For stage 1, the drone first heads straight towards a blue ball at (2,2,2). For stage 2, the drone needs to perform a two-flips rotation to turn its head 180 degrees.
Initialize:
You are a reward engineer trying to write reward functions to solve reinforcement learning tasks as effective as possible.
Your goal is to write a reward function for the environment that will help the agent learn the task described in text. And, if necessary, please also provide the modifications for other parts of the code.
Your reward function should use useful variables from the environment as inputs. As an example,
the reward function signature can be:
def calculate_metrics(self) -> None:
Since the reward function will be decorated with @torch.jit.script,
please make sure that the code is compatible with TorchScript (e.g., use torch tensor instead of numpy array).
Make sure any new tensor or variable you introduce is on the same device as the input tensors.
For quaternion conversion, you should use the official functions:
{quat_functions}
The Python environment is {task_obs_code_string}. Write a reward function for the following task: {task_description}. The current reward function is: {cur_reward_func}
The final lines of the reward function should consist of two items:
(1) compute the total reward,
(2) a dictionary of each individual reward component and raw information that are added into lists.
The code output should be formatted as a python code string: "```python ... ```".
Some helpful tips for writing the reward function code:
(1) You may find it helpful to normalize the reward to a fixed range by applying transformations like torch.exp to the overall reward or its components.
(2) If you choose to transform a reward component, then you must also introduce a temperature parameter inside the transformation function; this parameter must be a named variable in the reward function and it must not be an input variable. Each transformed reward component should have its own temperature variable. This is for tuning purpose
(3) Make sure the type of each input variable is correctly specified; a float input variable should not be specified as torch.Tensor
(4) Most importantly, the reward code's input variables must contain only attributes of the provided environment class definition (namely, variables that have prefix self.). Under no circumstance can you introduce new input variables.
Run into error:
Executing the reward function code above has the following error: {traceback_msg}. Please fix the bug and provide a new, improved reward function!
For improvement:
We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every {epoch_freq} epochs and the maximum, mean, minimum values encountered:
Please carefully analyze the policy feedback and provide a new, improved reward function that can better solve the task. Some helpful tips for analyzing the policy feedback:
(1) If the success rates are always near zero, then you must rewrite the entire reward function
(2) If the values for a certain reward component are near identical throughout, then this means RL is not able to optimize this component as it is written. You may consider
(a) Changing its scale or the value of its temperature parameter
(b) Re-writing the reward component
(c) Discarding the reward component
(3) If some reward components' magnitude is significantly larger, then you must re-scale its value to a proper range
Please analyze each existing reward component in the suggested manner above first, and then write the reward function code.