opendilab / DI-engine

OpenDILab Decision AI Engine. The Most Comprehensive Reinforcement Learning Framework B.P.
https://di-engine-docs.readthedocs.io
Apache License 2.0
2.95k stars 361 forks source link

`GAIL` the algorithm performs much worse than from its original paper. #692

Closed shenpengfii closed 1 year ago

shenpengfii commented 1 year ago

Before proposing this issue, I have searched it on document, issues and search engine.

My job is to reproduce the excellent performance of GAIL over BC in the setting of Cartpole, where the GAIL should at least perform the same level as the BC. And Both the GAIL and BC generate expert data by the pre-installed template of off-policy PPO (the original paper use TRPO).

However, the result of GAIL performs much worse than the BC, conflicting with the paper conclusion as well. On my PC, the GAIL takes at least 462 episodes to converge, while the BC taking only 80 episodes. By the way, I've set them with parameters as identical with the paper as I can. It's very disappointing.

I have to look into the entry function serial_pipeline_gail(), but to find it didn't work the way as the comment from cartpole_dqn_gail_config.py tells:

If collect_data is True, we will use this expert_model_path to collect expert data first, rather than we will load data directly from user-defined data_path.

But I find whether or not the collect_data is True, the collected expert data wouldn't be used in training the GAIL model:

# Load expert data
if collect_data:
    if expert_cfg.policy.get('other', None) is not None and expert_cfg.policy.other.get('eps', None) is not None:
        expert_cfg.policy.other.eps.collect = -1
    if expert_cfg.policy.get('load_path', None) is None:
        expert_cfg.policy.load_path = cfg.reward_model.expert_model_path
    collect_demo_data(
        (expert_cfg, expert_create_cfg),
        seed,
        state_dict_path=expert_cfg.policy.load_path,
        expert_data_path=cfg.reward_model.data_path + '/expert_data.pkl',
        collect_count=cfg.reward_model.collect_count
    )

After the collect_demo_data(), the expert_data seems not used by any following part, which may lead to the low performance of GAIL.

I started with the DI-engine without too many days, and I hope to figure out whether there's a bug in the implementation of GAIL. It really matters since the deadline of my work would come soon. I need a very timely help!

shenpengfii commented 1 year ago

I also looked into the implementation of BC, of which the entry function serial_pipeline_bc() indeed used the generated expert data within lines of code. Thus I wonder if it's possible to rapidly re-implement the entry pipeline of GAIL.

shenpengfii commented 1 year ago

Another key issue is that GailRewardModel the model class of GAIL didn't even offer a parameter to tune the num of network layers! In paper, the author claims his neural network as with two hidden layers of 100 units each, with tanh nonlinearities in between. But I didn't find any gateway for fixing the layer num of the GailRewardModel. So I also need this to be implemented.

shenpengfii commented 1 year ago

To reproduce the issue,

you can simply scan through the results of my project:


Or you can go follow the steps below:

Step 1

Download and save these 4 files into a file bin, i.e. PPO-GAIL:

Download and save these 2 files into another file bin, i.e. PPO-BC:

Step 2

In PPO-GAIL:

  1. run cartpole_ppo_offpolicy_main.py
  2. run cartpole_ppo_gail_main.py Then you will find the trained model of GAIL in PPO-GAIL/cartpole_ppo_gail_seed0

In PPO-BC: Directly run cartpole_bc_main.py, then you will find the model of BC in PPO-BC/cartpole_bc_seed0

Final Step

Go check the problem of performance.

PaParaZz1 commented 1 year ago

CartPole is a very naive environment, so its performance will fluctuate significantly due to some randomness and hyper-parameters. I suggest that you should conduct experiment MuJoCo environment such as Hopper. You can refer to this doc for more results about previous experiments.

shenpengfii commented 1 year ago

@PaParaZz1 I haven't had any convenient access to a Linux OS to make test with Hopper. But I have tested the performance between GAIL and BC in the settings of Cartpole, based-on off-policy PPO or DQN.

Unlike the BC imitating from off-policy PPO, in which the reward begin with an explicitly high score; the reward of GAIL imitating the same base model begin with a very low score, of which a commonplace is below the 1.0 over the total 195.0, and that low score will last in a long round of training.

I didn't think the GAIL is imitating from the off-policy PPO, because the result tells that the GAIL seems train itself without expert data, in terms of the training process.

I didn't find any module or function in code of serial_pipeline_gail() to use the generated expert data as well: The training process is just repeating with obtaining the new_data from a collector which didn't use the expert data, then the reward_model training with new_data instead of the expert data, and finally a learner receiving train_data_augmented but still not related with the expert data.

PaParaZz1 commented 1 year ago

You can join our slack channel and continue to discuss this problem. (related link)