Closed SunHaoOne closed 3 years ago
Yes I have had this issue before, but unfortunately I am not sure how to fix it. It only happened for me on some machines when other jobs were running. It is possible that reducing to replay_capacity=100000
here https://github.com/kzl/decision-transformer/blob/c9e6ac0b75895cef3e7c06cd309fd398ec9ceef5/atari/create_dataset.py#L45 could help resolve the issue, but I have not tried this yet. Let me know if that works!
Yes I have had this issue before, but unfortunately I am not sure how to fix it. It only happened for me on some machines when other jobs were running. It is possible that reducing to
replay_capacity=100000
herecould help resolve the issue, but I have not tried this yet. Let me know if that works!
Thanks for your reply. It is just the CPU memory problem. And I have changed the replay_capacity=100,000
and then changed the num_steps=100,000
also changed the follows:
line 65: if i >= 100,000:
I'm not sure if I made the right changes, but the program works fine and shows as follows:
this buffer has 3207 loaded transitions and there are now 100776 transitions total divided into 46 trajectories
max rtg is 84
max timestep is 2632
epoch 1 iter 1573: train loss 0.67590. lr 5.600293e-04: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1574/1574 [03:40<00:00, 7.14it/s]
target return: 90, eval return: 13
epoch 2 iter 1573: train loss 0.57056. lr 4.503075e-04: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1574/1574 [03:51<00:00, 6.80it/s]
target return: 90, eval return: 14
epoch 3 iter 1573: train loss 0.38198. lr 3.002664e-04: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1574/1574 [03:46<00:00, 6.96it/s]
target return: 90, eval return: 37
epoch 4 iter 1573: train loss 0.36018. lr 1.501538e-04: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1574/1574 [03:47<00:00, 6.93it/s]
target return: 90, eval return: 26
epoch 5 iter 1573: train loss 0.22186. lr 6.000000e-05: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1574/1574 [03:51<00:00, 6.79it/s]
target return: 90, eval return: 22
And it seems that the data is not enough, the eval return is less than the target, so do you have any suggestions? Thanks a lot.
Could you try keeping the original num_steps=500000
but using the reduced replay_capacity=100000
? (I think the replay_capacity
there can be smaller than num_steps
since it is referring to the number of samples loaded from each of the 50 checkpoints).
Could you try keeping the original
num_steps=500000
but using the reducedreplay_capacity=100000
? (I think thereplay_capacity
there can be smaller thannum_steps
since it is referring to the number of samples loaded from each of the 50 checkpoints).
Thanks for your quick reply. I have tried the original num_steps=500000
and the reduced replay_capacity=100000
again, it still showskilled
. And after checking $ top
, I found the KiB Swap
shows 0 free
. And later I will try the reduced num_steps=[from 100,000 to 500,000]
(And finally, only changing the num_steps=100,000
can make the program work )
Hm, I'm not sure then. (As a last resort, does setting --trajectories_per_buffer 10
work?) I have been thinking about saving the 1% replay dataset which should fix this problem but that will take some time. Sorry I don't have a better fix at the moment!
Hm, I'm not sure then. (As a last resort, does setting
--trajectories_per_buffer 10
work?) I have been thinking about saving the 1% replay dataset which should fix this problem but that will take some time. Sorry I don't have a better fix at the moment!
Great! It works well! And now it is about 380
trajectories with max rtg = 98,max time step=2654
. Then after training, I will paste the result here. Thanks again!
loading from buffer 28 which has 15858 already loaded
this buffer has 31057 loaded transitions and there are now 510358 transitions total divided into 380 trajectories
max rtg is 98
max timestep is 2654
epoch 1 iter 7972: train loss 0.81460. lr 5.598514e-04: 100%|█| 7973/7973 [19:25
target return: 90, eval return: 82
epoch 2 iter 7972: train loss 0.66133. lr 4.500607e-04: 100%|████████| 7973/7973 [19:05<00:00, 6.96it/s]
target return: 90, eval return: 55
epoch 3 iter 7972: train loss 0.50567. lr 3.000525e-04: 100%|████████| 7973/7973 [19:47<00:00, 6.71it/s]
target return: 90, eval return: 63
epoch 4 iter 7972: train loss 0.44295. lr 1.500303e-04: 100%|████████| 7973/7973 [20:56<00:00, 6.34it/s]
target return: 90, eval return: 55
epoch 5 iter 7972: train loss 0.36810. lr 6.000000e-05: 100%|████████| 7973/7973 [20:43<00:00, 6.41it/s]
target return: 90, eval return: 42
It seems that the eval return is higher than before, and maybe this helps to train the agent.
Yes that looks right! I will change the default trajectories_per_buffer
in the scripts in case others have similar issues.
Hi I was facing the same issue and thus tried with the following config replay_capacity=100000
and trajectories_per_buffer=10
, Max eval return I got is 65.
this buffer has 31057 loaded transitions and there are now 510358 transitions total divided into 380 trajectories
max rtg is 98
max timestep is 2654
0%| | 0/3987 [00:00<?, ?it/s]/home/pushkalkatara/mrd/conda/envs/decision-transfor-transformer-atari/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
epoch 1 iter 3986: train loss 0.82629. lr 5.598514e-04: 100%|████████████████████████████████████████████████████████████████████████████████| 3987/3987 [31:37<00:00, 2.10it/s]
target return: 90, eval return: 25
epoch 2 iter 3986: train loss 0.69285. lr 4.500607e-04: 100%|████████████████████████████████████████████████████████████████████████████████| 3987/3987 [28:00<00:00, 2.37it/s]
target return: 90, eval return: 65
epoch 3 iter 3986: train loss 0.54726. lr 3.000525e-04: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3987/3987 [29:35<00:00, 2.25it/s]
target return: 90, eval return: 32
epoch 4 iter 3986: train loss 0.49444. lr 1.500303e-04: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3987/3987 [31:52<00:00, 2.09it/s]
target return: 90, eval return: 49
epoch 5 iter 3986: train loss 0.44959. lr 6.000000e-05: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3987/3987 [27:16<00:00, 2.44it/s]
target return: 90, eval return: 65
Hello, Thanks for you code. And after reading
readme-atari.md
, I have set the env and downloaded the dataset, then I tried to run the follows: (--block_size 90
There is no block_size args, so I removed it )Then it shows:
Then it shows
killed
and I guess when loading thedataset
,this problem is due to excessive memory usage and how to fix it?Thanks a lot.