spragunr / deep_q_rl

Theano-based implementation of Deep Q-learning
BSD 3-Clause "New" or "Revised" License
1.08k stars 348 forks source link

Mean Q calculations different from paper (or incorrect?) #4

Closed alito closed 9 years ago

alito commented 9 years ago

Hi,

The code around calculating the mean Q value that is written to results.csv (ie the code starting around line 410 of rl_glue_ale_agent.py) has a couple of issues I think. The main one is that it outputs the mean Q values across all actions instead of the mean of the max action Q values like the Deep Mind paper does. I believe from the look of the code that the intention was to output the mean of the max action Q values.

The other, related, issue is that the mean is calculated across only 100 phis instead of across 100 batches of 32 phis like the code seems to want to do. The code asks for

self.holdout_data = self.data_set.random_batch(holdout_size * self.batch_size)[0]

and that returns something of shape (3200, 4, 80, 80), but then it iterates with:

for i in range(holdout_size):
    holdout_sum += np.mean(self.network.q_vals(self.holdout_data[i, ...]))

pressumably assuming that random_batch indexed each batch separately and that q_vals takes a batch at a time, but random_batch just numbers each phi set separately instead and q_vals only takes one example. eg looking at the shape of self.holdout_data[0, ...] it gives (4, 80, 80). network.q_vals then returns something of shape (18,) (ie one for each action).

I believe the following code should work

        for i in range(holdout_size * self.batch_size):
            holdout_sum += np.max(
                self.network.q_vals(self.holdout_data[i, ...]))

        self._update_results_file(epoch, self.episode_counter,
                                  holdout_sum / (holdout_size * self.batch_size))

(Thank you very much for releasing this, by the way. I am still struggling through the theano parts in cnn_q_learner.py. It does my head in. Do you think it'd be worth switching to the cuDNN theano wrapper instead of the cuda_convnet now that that's been released?)

AjayTalati commented 9 years ago

Unfortunately, I could'nt get any improvement from the suggested change after 15 epochs?

The numbers in results.csv are all roughly the same as the first epoch, and if I play the .pkl file from the 15th epoch it looks like its got Alzheimer's?

(Just out of curiosity, I wonder what your views on adding Monte Carlo Tree search to selecting training data are? It seems to significantly improve performance, there's a few well documented implementations. Its interesting too.

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning

alito commented 9 years ago

The changes don't touch the learning code. They just change what is recorded to results.csv.

(15 epochs is not enough to see much, at least for breakout. At 60 epochs it should be pretty clear though)

AjayTalati commented 9 years ago

Hi, sorry! Yes, I understand that script a bit better now.

I'm trying the changes now with Pong, that should be quick? I just wonder if you have a working ROM of Othello? Or a game to do quick tests with?

One last thing, is there a simple way of restarting the training of a saved network? It takes 36 hrs to run 100 epochs of Breakout and it got upto a per episode average of around 60.

I tried to restart the training, by replacing the environment and agent process start up lines p3 and p4 in ale_run.py with,

p3 = subprocess.Popen(['./rl_glue_ale_experiment.py', '--epoch_length', '50000'], env=my_env)

p4 = subprocess.Popen(['./rl_glue_ale_agent.py', "--nn_file", "/home/ajay/PythonProjects/deep_q_rl/_01-02-14-20_0p0001_0p9/network_file_100.pkl"], env=my_env)

Which loads the network and restarts the training fine, but its still not managed to get above the 60 level? Is this to be expected? Is it because the history of the dataset class is empty when the experiment is started again?

Output of Results.csv

epoch num_episodes total_reward reward_per_episode 1 10 439 43.9 2 10 445 44.5 3 10 459 45.9 4 9 421 46.7777777778 5 9 406 45.1111111111 6 10 420 42 7 10 400 40 8 9 462 51.3333333333 9 9 423 47 10 9 440 48.8888888889 11 10 438 43.8 12 10 396 39.6 13 9 380 42.2222222222 14 10 397 39.7 15 9 431 47.8888888889 16 8 459 57.375 17 10 418 41.8 18 11 346 31.4545454545 19 11 342 31.0909090909 20 11 401 36.4545454545 21 8 460 57.5 22 12 294 24.5 23 9 477 53

spragunr commented 9 years ago

@alito Thanks for pointing this out. I've addressed it in master by changing

self.holdout_data = self.data_set.random_batch(holdout_size * self.batch_size)[0]

to

self.holdout_data = self.data_set.random_batch(holdout_size)[0]

and increasing holdout size to 3200. I think this is a bit clearer because the batch size doesn't really have anything to do with this calculation.

As for cuDNN: that's a good idea, but it is unlikely to make it to the top of my todo list soon. For one thing, I'm still on CUDA 5.5. I would be willing to incorporate a pull request if you are interested in taking this on.

spragunr commented 9 years ago

@AjayTalati I don't remember where I found my ROM files, but if you google around you should be able to find any game you are interested in without too much difficulty.

It looks like your approach to resuming learning is correct. It may be that performance doesn't improve because the network has reached a local maxima.