About the result of Pong

lucasliunju commented 6 years ago

Dear ronsailer, I'm very sorry to trouble you. First, thanks for your contribution, and I am running the code on Pong and can not get a better result. So I want to ask whether you do this experiment. I'm looking forward to your reply.

ronsailer commented 6 years ago

Hi Lucas, I'm glad to see that someone is using this code! :)

You're very welcome. What do you mean you can't get a better result. Better than what? Unfortunately I haven't run the Pong experiment. This code came standard with Breakout and I did not try to run it on Pong. The only code that I ran Pong on is this: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr

I think that in order to get this code working on Pong you have to readjust the network hyperparameters like the input size but it sounds to me like you've already got it up and running.

Just a heads-up: With the code as-is (as of yesterday), it will not run because I'm currently in the process of translating it from Theano to PyTorch and it's broken. It worked yesterday though before I started migrating it to PyTorch.

The original code can be found here: https://github.com/jeanharb/a2oc_delib but from my experience, it's outdated and can't immediately be executed after cloning it. You'll need to tinker a bit with the lasagne library and remove some stuff from there. I've written down the changes I had to do to get that code running:

Install the dependencies
Make sure that .theanorc has floatX=float32 configured
Lasagne is incompatible with the latest version of Theano. You need to manually edit it and change "from theano.tensor.signal import downsample" to "from theano.tensor.signal.pool import pool_2d" and then corresponding function calls, so instead of downsample.max_pool_2d you will now simply call pool_2d.

(Comment on the above: I'm not sure what I meant with it. If I meant Lasagne or Theano. I recall having to edit some import that was outdated in the Lasagne code)

Also, I'm currently "hard-coding" it to work on Gridworld (https://github.com/nadavbh12/gym-gridworld) which I've attached to this repo but I will eventually just add it to the requirements or properly include in the code, not the way it's currently being used. With "hard-coding" I mean things like the input size of the network, which, instead of using magic numbers, should be inferred from the environment that's being used (Pong, Breakout, Gridworld, etc.).

Best of luck, Ron

lucasliunju commented 6 years ago

Dear ronsailer, I'm very sorry to trouble you again. I'm running your code again. I found some error about "AttributeError: 'GridworldEnv' object has no attribute 'viewer'". So I want to ask whether the code is complete. I'm looking forward to your reply.

ronsailer commented 6 years ago

Hi Lucas,

This repo is discontinued. I've stopped working on this halfway through and instead implemented A2OC based off of https://github.com/ikostrikov/pytorch-a2c-ppo-acktr

Let me upload my code, I'll link it to you here.

ronsailer commented 6 years ago

@lucasliunju please see: https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr

But this one also isn't ready. The architecture works but the only thing that's missing are a few lines for the termination loss so that the policy over options will learn as well (so right now it's as if options are being chosen at random). It converges and learns to play games nonetheless, just slower because you need to train all options as they are being chosen at random and only get 1/n of the actions to learn from, where n is the number of options.

lucasliunju commented 6 years ago

Hi Sailer, Thank you for your warm reply. I'll try to run the code you provided. I have tried to reproduce this algorithm in the past few days, but I have not got good results. As far as I know, a2oc_delib is state of the art in option discovery. Maybe I should try to run the author's original code.

ronsailer commented 6 years ago

Jean Harb's code (a2oc_delib) works after you change a few things unrelated to the algorithm itself. If I recall correctly, lasagne (a Python module) was changed and the import statements are now wrong and outdated but it doesn't take long to fix them and get it up and running. If you're having trouble feel free to ask me.

lucasliunju commented 6 years ago

Hi Sailer, Thanks! I can run the code under cpu. When I use gpu, It will be many strange errors. I think the vision of cuda an cudnn maybe a problem. So I want to ask the vision of your cuda and cudnn. Thanks.

ronsailer commented 6 years ago

Hi Lucas,

There was a mistake on my part and it really was broken. I've been editing the code a lot on my laptop which doesn't run CUDA and I didn't expect anyone else to run it anytime soon. I've pushed a fix. Please try again now and let me know if it works. It does on my CUDA machine.

Also, please let us move the conversation over there. You can open a new issue at https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr if you want.

ronsailer commented 6 years ago

@lucasliunju Lucas, check out the code at https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr now. I believe I've fixed the termination loss and the algorithm should be complete now. It works for Gridworld. I'm now training it on Pong.

lucasliunju commented 6 years ago

Okay. Please wait for me a moment. I'm trying it.

Ron Sailer notifications@github.com 于2018年7月6日周五上午2:18写道：

@lucasliunju https://github.com/lucasliunju Lucas, check out the code at https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr now. I believe I've fixed the termination loss and the algorithm should be complete now. It works for Gridworld. I'm now training it on Pong.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ronsailer/a2oc_pytorch/issues/1#issuecomment-402810074, or mute the thread https://github.com/notifications/unsubscribe-auth/AXvZppM2Vd9-TDZ5AxyxwZKxnW11Y1o0ks5uDliHgaJpZM4UGo-2 .

lucasliunju commented 6 years ago

I'm very sorry. There is something wrong with my server. I'm trying to fix it. I think it's necessary to inform you. 刘勇邮箱：lucasliunju@gmail.com 签名由网易邮箱大师定制 On 07/06/2018 08:48, 刘勇 wrote: Okay. Please wait for me a moment. I'm trying it. Ron Sailer notifications@github.com 于2018年7月6日周五上午2:18写道： @lucasliunju Lucas, check out the code at https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr now. I believe I've fixed the termination loss and the algorithm should be complete now. It works for Gridworld. I'm now training it on Pong. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lucasliunju commented 6 years ago

I'm very happy to tell you that I can successfully run your implementation of a2oc in CUDA. https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr. But I find I can not open a new issue.

ronsailer commented 6 years ago

Glad to hear it works! I've trained a model on Breakout and it can play it.

In the paper it says that if you don't use a deliberation cost the termination probability quickly rises to 100% so that the options terminate after every step. I did not see this happen with my code. I hope I did not make a mistake.

Also, the deliberation cost should be negative, right? The algorithm in the paper adds it to the reward if there was a switch in options, and it should be a penalty. The table at the bottom at the Experiments section mentions they've tried it with deliberation costs between 0 and 0.03 with increments of 0.005 but they did not mention the sign.

I'm using a negative deliberation cost, -0.1 for example.

lucasliunju commented 6 years ago

Yes. From the results of option-critic and a2oc (the section of deliberation cost is 0), we can find the option will soon be terminate. and I think it can not show the ability of option.

lucasliunju commented 6 years ago

As for the setting of deliberation cost, I think it maybe related to the setting of super parameters. I‘m looking forward to your new results.

lucasliunju commented 6 years ago

Dear Sailer， I'm very sorry to trouble you again. I'm trying to compare the results with Jean Harb's code. I want to ask whether you run Jean Harb's code successfully on gpu. I can just run the code on cpu. Thanks

ronsailer commented 6 years ago

Hi Lucas, sorry for the late reply. No I did not attempt to run his code on a gpu.

lucasliunju commented 6 years ago

Hi, Ron

I'm very sorry to trouble you again. I'm trying to run the code on BreakoutNoFrameskip-v4 ( https://github.com/ronsailer/pytorch-a2c-a2oc-ppo-acktr) and I find the result is not good. So I want to ask whether some parameters should be changed.

Thnks.

Lucas

Ron Sailer notifications@github.com 于2018年7月13日周五下午10:58写道：

Hi Lucas, sorry for the late reply. No I did not attempt to run his code on a gpu.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ronsailer/a2oc_pytorch/issues/1#issuecomment-404858863, or mute the thread https://github.com/notifications/unsubscribe-auth/AXvZphHLQVEcSDLL4CN_FjUJ8UJXwMqiks5uGLWygaJpZM4UGo-2 .

ronsailer commented 6 years ago

Hi Lucas, please make sure you're running the latest version. I've uploaded a fix for the termination gradient about 2 days ago. There was a mistake an indeed the termination head did not converge. I find the results to be much better now.

Try AmidarNoFrameskip-v4 with 4 options and deliberation cost of 0.005: python main.py --env-name AmidarNoFrameskip-v4 --num-options 4 --delib 0.005

10m frames (default) is enough to see that it has learned not to switch options except for certain times. You can really see the deliberation cost hover around 0.00 and only occasionally go up and when it does he switches. If you want to get better results I suggest increasing the number of frames. For me, after 10m frames, the reward hovers around 100.

I suggest you add this line to act_enjoy() in model.py, after the line "rand_num = torch.rand(1)": print("option: {} termination: {:.3f} rand: {:.3f}".format(self.current_options.item(), self.terminations.item(), rand_num.item()))

I'm now working on adding a tracker to track things like termination probability over time and option choice over time:

Ignore the x-axis markers, this is after 10m iterations with the same configuration as above but with 8 options instead of 4. The results seem to be consistent with the paper. I'll have to try and run the other deliberation configurations as well.

ronsailer commented 6 years ago

I've started a job with BreakoutNoFrameskip, with delib=0.03 (highest in paper). These are the results so far (about 500k frames):

I told you previously to ignore the x-axis markers, they represent the number of samples I took.

lucasliunju commented 6 years ago

Hi Ron,

Thanks for your help so much! It' very helpful for me.

That's my current result.

[image: visdom_image.jpg]

I think maybe the algorithm has converged.

Ron Sailer notifications@github.com 于2018年7月24日周二下午12:43写道：

[image: image] https://user-images.githubusercontent.com/12458566/43117387-31f0e884-8f15-11e8-87cd-e245c815c3dc.png

I've started a job with BreakoutNoFrameskip and this is the result right now for delib = 0.03 (high) and it looks like it quickly converges to 0. I told you previously to ignore the x-axis markers, they represent the number of samples I took. This is after about 250k frames

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ronsailer/a2oc_pytorch/issues/1#issuecomment-407279326, or mute the thread https://github.com/notifications/unsubscribe-auth/AXvZptE438noz-OqM9FdJ-GuHAsNKYwjks5uJqXpgaJpZM4UGo-2 .

ronsailer commented 6 years ago

Lucas, I can't see the image.

lucasliunju commented 6 years ago

Sorry. That's my current result. [image: visdom_image.jpg]

Lucas

Ron Sailer notifications@github.com 于2018年7月24日周二下午9:30写道：

Lucas, I can't see the image.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ronsailer/a2oc_pytorch/issues/1#issuecomment-407406926, or mute the thread https://github.com/notifications/unsubscribe-auth/AXvZpu8jRA50h5A3zRgTHQUzanUeZ_Moks5uJyFXgaJpZM4UGo-2 .

lucasliunju commented 6 years ago

Hi Ron,

Maybe I should wait a few hours.

刘勇 lucasliunju@gmail.com 于2018年7月24日周二下午9:34写道：

Sorry. That's my current result. [image: visdom_image.jpg]

Lucas

Ron Sailer notifications@github.com 于2018年7月24日周二下午9:30写道：

Lucas, I can't see the image.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ronsailer/a2oc_pytorch/issues/1#issuecomment-407406926, or mute the thread https://github.com/notifications/unsubscribe-auth/AXvZpu8jRA50h5A3zRgTHQUzanUeZ_Moks5uJyFXgaJpZM4UGo-2 .

ronsailer commented 6 years ago

still can't see. email it to me at ronsailer@gmail.com

I've pasted my images straight from the clipboard, not uploaded as a file, maybe that helps.

ronsailer / a2oc_pytorch

About the result of Pong #1