ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
31.94k stars 5.44k forks source link

[Feature] Add support for GAIL/AIRL #19394

Open wpwpxpxp opened 2 years ago

wpwpxpxp commented 2 years ago

Search before asking

Description

Create an agent for the Generative Adversarial Imitation Learning (GAIL) and/or Adversarial Inverse Reinforcement Learning (AIRL), which can integrate the imitation learning and reinforcement learning into one trainer.

Use case

Learn decision and control behaviors from demonstrations, by training a discriminator with standard SGD and a generator with RL algorithms.

Related issues

No response

Are you willing to submit a PR?

heng2j commented 2 years ago

Hi @wpwpxpxp, thank you for adding this issue and willing to submit a PR. How's the progress for this new feature?

wpwpxpxp commented 2 years ago

Hi,

Thanks for getting back to me!!

I'm working on it, and trying to make the training of the two models (discriminator and generator) in AIRL algorithm separate and in parallel so that we can have the flexibility in the future to use any RL algorithm to train the generator and any supervised learning approach to train the discriminator.

However, I've been stuck with an issue in Ray for a couple of days.

The policy in Ray is built under tf graphs and sessions. I cannot figure out how the sessions and graphs are used/called when the policy is being trained. This leads to my issue.

Basically, I have a class, RewardRecalculte(), which is a copy of the discriminator model and its weights will be updated periodically with weights from the original discriminator model. RewardRecalculte() is used to recalculate the reward (and thus the advantage) in the sample batch that is used for training of the generator/policy.

The problem is when I put the RewardRecalculte() as a mixin in the build_tf_policy(), the RewardRecalculte() and the policy are built under the same graph, but an error is raised "tensorflow.python.framework.errors_impl.InvalidArgumentError: Tensor default_policy/action_logp_1:0, specified in either feed_devices or fetch_devices was not found in the Graph" when the policy is being trained.

Could you please instruct how I should handle the session and graph in training if I want to add a mixin class (e.g. a Neural network model) in the policy (e.g. PPOTFPolicy)? Or could you please suggest whom I should ask for such kind of questions? Thanks a lot!

-- Best regards, Pin Wang

On Fri, Dec 10, 2021 at 1:12 PM heng2j @.***> wrote:

Hi @wpwpxpxp https://github.com/wpwpxpxp, thank you for adding this issue and willing to submit a PR. How's the progress for this new feature?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/19394#issuecomment-991296488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AII5SMRUKLQIKVUPFU6EXS3UQJUL3ANCNFSM5GALOARA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

zbzhu99 commented 2 years ago

Hi, Thanks for getting back to me!! I'm working on it, and trying to make the training of the two models (discriminator and generator) in AIRL algorithm separate and in parallel so that we can have the flexibility in the future to use any RL algorithm to train the generator and any supervised learning approach to train the discriminator. However, I've been stuck with an issue in Ray for a couple of days. The policy in Ray is built under tf graphs and sessions. I cannot figure out how the sessions and graphs are used/called when the policy is being trained. This leads to my issue. Basically, I have a class, RewardRecalculte(), which is a copy of the discriminator model and its weights will be updated periodically with weights from the original discriminator model. RewardRecalculte() is used to recalculate the reward (and thus the advantage) in the sample batch that is used for training of the generator/policy. The problem is when I put the RewardRecalculte() as a mixin in the build_tf_policy(), the RewardRecalculte() and the policy are built under the same graph, but an error is raised "tensorflow.python.framework.errors_impl.InvalidArgumentError: Tensor default_policy/action_logp_1:0, specified in either feed_devices or fetch_devices was not found in the Graph" when the policy is being trained. Could you please instruct how I should handle the session and graph in training if I want to add a mixin class (e.g. a Neural network model) in the policy (e.g. PPOTFPolicy)? Or could you please suggest whom I should ask for such kind of questions? Thanks a lot!

Hi, @wpwpxpxp, I am also interested in implementing adversarial imitation learning algorithms in rllib. I wonder if you are still working on it or what is the current working status. If you want, I can participate in the development together. You can reach me via zbzhu99@icloud.com. Thank you!

wpwpxpxp commented 2 years ago

Hi @zbzhu99,

Thanks for reaching out to me. Yes, I'm still working on it, and have finished the first version. It can run under some gym environments. It's great that you're interested in it, and I'm happy to collaborate on it.

acxz commented 2 years ago

This may be slightly off topic, but I have also become interested in inverse/imitation RL techniques and found the following python package: imitation based on stable-baselines3.

From their readme:

Currently, we have implementations of Behavioral Cloning, DAgger (with synthetic examples), density-based reward modeling, Maximum Causal Entropy Inverse Reinforcement Learning, Adversarial Inverse Reinforcement Learning, Generative Adversarial Imitation Learning and Deep RL from Human Preferences.

It seems like BC already exists in rllib and @wpwpxpxp is working on AIRL. This leaves DAgger, density-based reward modeling, Maximum Causal Entropy Inverse RL and GAIRL for anyone to tackle.

@zbzhu99 maybe you want to tackle GAIRL since you mentioned interest in adversarial algorithms? I'll be tackling Maximum Entropy. @wpwpxpxp if you could share your current progress that would be wonderful.

Edit: Other less popular codebases: imitiation-learning and ilpyt which have other implementations.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

acxz commented 1 year ago

not stale

romanlee6 commented 1 year ago

@acxz @wpwpxpxp Hi! I just saw this issue and am interested in getting some of the imitation learning algorithms running in rllib as well. I was just wondering if you are still working on this project and what is the progress now. If possible, could you please point me to a repo of your current implementation so I can start taking a look and contributing? Thanks!

acxz commented 1 year ago

I'm trying to tackle MCEIRL rn, hopeufully in a couple of weeks I'll have something worthwhile to share.

wpwpxpxp commented 1 year ago

Hi,

The algorithm that I worked on is AIRL (Adversarial Inverse Reinforcement Learning). It is based on ray/rllib and openai gym environment. It works on my local machine and system environment, but it may need some changes on the environment settings to make it platform independent.

Could you tell me some of your background and how you plan to collaborate? Thanks.

-- Best regards, Pin Wang

On Sun, Dec 4, 2022 at 9:18 AM Akash Patel @.***> wrote:

I'm trying to tackle MCEIRL rn, hopeufully in a couple of weeks I'll have something worthwhile to share.

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/19394#issuecomment-1336471179, or unsubscribe https://github.com/notifications/unsubscribe-auth/AII5SMWXAG3UF7B7ES553GDWLTG6JANCNFSM5GALOARA . You are receiving this because you were mentioned.Message ID: @.***>

romanlee6 commented 1 year ago

@wpwpxpxp I am a Ph.D. student in CS. I tried to apply AIRL in recovering reward functions from expert trajectories in a multiagent cooperative game. I built my environment and trained forward RL agents based on rllib, but realized that most AIRL implementations out there are not compatible with rllib. So I was wondering if you would like to public your implementation by any chance? I could contribute to 1) extending the code to support multi-agent based on the MA-AIRL paper and 2) supporting custom discriminator and generator. Please let me you what you think about it. You can reach me at huaoromanli@gmail.com. Thanks!

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

acxz commented 1 year ago

just pinging, as this is still a feature that is desirable in rllib

wpwpxpxp commented 1 year ago

Thanks, I developed the AIRL algorithm and was thinking to reorganize my code to open source it.

On Sat, Apr 8, 2023 at 10:51 Akash Patel @.***> wrote:

just pinging, as this is still a feature that is desirable in rllib

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/19394#issuecomment-1500943586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AII5SMQM4AF6GMY7IWW3ZH3XAGQRJANCNFSM5GALOARA . You are receiving this because you were mentioned.Message ID: @.***>

--

Best regards, Pin Wang

stale[bot] commented 10 months ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

wpwpxpxp commented 10 months ago

I still want to open source it, just got quite busy on my work and didn't spend much time on it. But I'll reach out to people who interested in it and try to find a way to open source it.

verobianca commented 4 months ago

Hi @wpwpxpxp, I'm currently working on a problem using rllib and AIRL. Just wanted to know if you opened source your code in the end.

Thanks