ur-whitelab / LLMs-in-science

138 stars 15 forks source link

New paper: Beyond Human Preferences: Exploring Reinforcement Learning Trajectory #7

Open maykcaldas opened 3 months ago

maykcaldas commented 3 months ago

Paper: Beyond Human Preferences: Exploring Reinforcement Learning Trajectory

Authors: Zichao Shen, Tianchen Zhu, Qingyun Sun, Shiqi Gao and Jianxin Li

Abstract: Reinforcement learning (RL) faces challenges in evaluating policytrajectories within intricate game tasks due to the difficulty in designingcomprehensive and precise reward functions. This inherent difficulty curtailsthe broader application of RL within game environments characterized by diverseconstraints. Preference-based reinforcement learning (PbRL) presents apioneering framework that capitalizes on human preferences as pivotal rewardsignals, thereby circumventing the need for meticulous reward engineering.However, obtaining preference data from human experts is costly andinefficient, especially under conditions marked by complex constraints. Totackle this challenge, we propose a LLM-enabled automatic preference generationframework named LLM4PG , which harnesses the capabilities of large languagemodels (LLMs) to abstract trajectories, rank preferences, and reconstructreward functions to optimize conditioned policies. Experiments on tasks withcomplex language constraints demonstrated the effectiveness of our LLM-enabledreward functions, accelerating RL convergence and overcoming stagnation causedby slow or absent progress under original reward structures. This approachmitigates the reliance on specialized human knowledge and demonstrates thepotential of LLMs to enhance RL's effectiveness in complex environments in thewild.

Link: https://arxiv.org/abs/2406.19644

Reasoning: Reasoning: Let's think step by step in order to produce the is_lm_paper. We start by examining the title and abstract. The title mentions "Reinforcement Learning Trajectory," which suggests a focus on reinforcement learning (RL). The abstract discusses challenges in RL, specifically in designing reward functions and using human preferences. It introduces a framework named LLM4PG that uses large language models (LLMs) to generate preferences and optimize policies. The key point here is the use of LLMs to enhance RL, indicating that the paper involves language models in a significant way.