rraileanu / policy-dynamics-value-functions

33 stars 5 forks source link

Zero reward in Reinforcement Learning Phase #1

Closed c4cld closed 3 years ago

c4cld commented 3 years ago

When I run python ppo/ppo_main.py --env-name spaceship-v0 --default-ind 0 --seed 0, the content on terminal is as follows: image I think there is something wrong, could you give me some hints for solving the problem?

rraileanu commented 3 years ago

For reproducing the paper results, you have to run the RL phase for 3e6 steps and it looks like you've only ran it for about 1.3e5 steps.

It can also happen that some seeds lead to zero reward even after training for millions of steps, but on average over 5-10 seeds, you should see rewards > 0.

Let me know if you are still having trouble reproducing the results.

c4cld commented 3 years ago

@rraileanu Thank you very much. I've seen rewards > 0 after I change the seed.
But another error occurs. When I run python train_dynamics_embedding.py \ --env-name spaceship-v0 \ --dynamics-embedding-dim 8 --dynamics-batch-size 8 \ --inf-num-steps 1 --num-dec-traj 10 \ --save-dir-dynamics-embedding ./models/dynamics-embeddings

The error message is as follows: FileNotFoundError: [Errno 2] No such file or directory: './models/ppo-policies/ppo.spaceship-v0.env0.seed2.pt'

I check the args.num_envs and args.num_seeds. They are respectively 50 and 5. For args.num_seeds, I can meet the need by just running ppo/ppo_main.py with 5 different seeds. But the args.num_envs makes me confused, how should I do to meet the requirement of args.num_envs? What's the meaning of env in your code? Should I change something of ppo/ppo_main.py to get 50 envs?

c4cld commented 3 years ago

Another question: Does the Dynamics Embedding mean the embedding of the environment dynamics? If so, I believe the embedding is not time-varying. Is my idea right?

rraileanu commented 3 years ago

Yes, the dynamics embedding should capture the environment's dynamics and be time invariant.

It looks like perhaps you are saving the models in a different directory or perhaps with a different name than the code is expecting. You can probably simply change the name of the file from which it is loading the policies to match yours.

You need to change both the --seed and the --default-ind arguments in order to get both multiple policies trained on the same environment (i.e. same default-ind) as well as get policies trained on multiple environments (where the dynamics are determined by default-ind). As noted in the README, you should use --seed from 0 to 4 (or any other 5 integers) and --default-ind from 0 to 19.

I hope this clarifies things but let me know if you have more questions.

c4cld commented 3 years ago

@rraileanu Thank you for your guide. I've followed your instructions and the code is being run.

But I have another question about this paper.

You mention that a feed-forward network with parameters Ψ is first used to obtain a lower triangular matrix L(s0; zd; Ψ). However, I do not know how to implement it. In other words, I do not know how to use a neural network to get a matrix. In computer vision tasks, neural networks output the correct label(maybe a one-hot vector or a more smooth vector). In decision-making tasks, neural networks output the distribution of the possibility of taking action(A vector whose sum of elements is 1).

In sum, I can understand neural networks output a vector. But I don't know how to use neural networks to obtain a matrix. Could you give me some hints about it?

rraileanu commented 3 years ago

You can find the implementation of that network here: https://github.com/rraileanu/policy-dynamics-value-functions/blob/master/pdvf_networks.py#L31.

The output of the network is a vector with dimension M^2 for a matrix of dimensions MxM. The vector is then reshaped into a matrix.

c4cld commented 3 years ago

@rraileanu Thank you for your help. I've understood how the matrix is generated. What's more, I've run the following code: A Reinforcement Learning Phase for seed in 0 1 2 3 4 do for defaultind in 0 1 2 do python ppo/ppo_main.py --env-name spaceship-v0 --seed $seed --default-ind $defaultind done done

Because I have to make a report as soon as possible, the range of default-ind is shrunk to [0,2].

And I got the files in the following picture: image

B Self-Supervised Learning Phase B.1 Dynamics Embedding

python train_dynamics_embedding.py \ --env-name spaceship-v0 \ --dynamics-embedding-dim 8 --dynamics-batch-size 8 \ --inf-num-steps 1 --num-dec-traj 10 \ --save-dir-dynamics-embedding ./models/dynamics-embeddings

I modified the args.num_envs from 20 to 3, in order to run the code. And I got the files in the following picture: image

B.2 Policy Embedding

python train_policy_embedding.py \ --env-name spaceship-v0 --num-dec-traj 1 \ --save-dir-policy-embedding ./models/policy-embeddings

I modified the args.num_envs from 20 to 3, in order to run the code. And I got the files in the following picture: image

C Supervised Learning Phase python train_pdvf.py \ --env-name spaceship-v0 \ --dynamics-batch-size 8 --policy-batch-size 2048 \ --dynamics-embedding-dim 8 --policy-embedding-dim 8 \ --num-dec-traj 10 --inf-num-steps 1 --log-interval 10 \ --save-dir-dynamics-embedding ./models/dynamics-embeddings \ --save-dir-policy-embedding ./models/policy-embeddings \ --save-dir-pdvf ./models/pdvf-models

I modified the args.num_envs from 20 to 3, in order to run the code. But I did not get the pdvf-model. The save-dir-pdvf ./models/pdvf-models is even not created. The content displayed by terminal is as follows: image

The folder hierarchy is shown in the following figure: image

Due to this mistake, I cannot run the code in Evaluation Phase. Could you give any advice to solve the problem?

rraileanu commented 3 years ago

Hi. In order to debug this, you should check that the model enters this if statement where the model saving is being done: https://github.com/rraileanu/policy-dynamics-value-functions/blob/master/train_pdvf.py#L268. Make sure the model is being saved and print the path where it is being saved.

Also note that the folder for the model is ../gvf_logs/pdvf-policies/, as shown here https://github.com/rraileanu/policy-dynamics-value-functions/blob/master/train_pdvf.py#L268, so I think you are not looking in the right folder.

You can also set the flag while training the PDVF model: --save-dir-pdvf ./models/pdvf-policies/ so that it saves the PDVF models under the models directory.

c4cld commented 3 years ago

@rraileanu Thank you very much. I'll try it as soon as possible.