Several issues on current version

zmonoid commented 8 years ago

First, I tried dqn_dist_demo.py on my PC, it runs smoothly. However when I tried it on my server, it will report such error:

Traceback (most recent call last):
  File "dqn_dist_demo.py", line 6, in <module>
    from arena.games import AtariGame
  File "/home/ubuntu/Arena/arena/games/__init__.py", line 3, in <module>
    from .cartpole_box2d import CartpoleSwingupEnv
  File "/home/ubuntu/Arena/arena/games/cartpole_box2d.py", line 5, in <module>
    from arena.games.box2d.xml_box2d import find_body
  File "/home/ubuntu/Arena/arena/games/box2d/xml_box2d.py", line 5, in <module>
    import Box2D
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/Box2D/__init__.py", line 20, in <module>
    from .Box2D import *
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/Box2D/Box2D.py", line 435, in <module>
    _Box2D.RAND_LIMIT_swigconstant(_Box2D)
AttributeError: 'module' object has no attribute 'RAND_LIMIT_swigconstant'

It seems to be a swig version problem, in which case swig needs to be upgraded to swig3.0+ according to the error information:

sudo apt-get remove swig
sudo apt-get install swig3.0
sudo ln -s /usr/bin/swig3.0 /usr/bin/swig

However it does not fix my problem even if I did so.

I would suggest we can move game environment to openai/gym, which is much easier to install and use, and provides uniform interface for various games. Even for the racing game TORCS: https://github.com/ugo-nama-kun/gym_torcs

Second, notice this segment of code in dqn_async.py:

            for g, game in enumerate(games):
                #1. We need to choose a new action based on the current game status
                if games[g].state_enabled and games[g].replay_memory.sample_enabled:
                    do_exploration = (npy_rng.rand() < eps_curr[eps_id[g]])
                    if do_exploration:
                        action = npy_rng.randint(action_num)
                    else:
                        # TODO Here we can in fact play multiple gaming instances simultaneously and make actions for each
                        # We can simply stack the current_state() of gaming instances and give prediction for all of them
                        # We need to wait after calling calc_score(.), which makes the program slow
                        # TODO Profiling the speed of this part!
                        action = actions_that_max_q[g]
                        episode_stats[g].episode_q_value += qval_npy[g, action]
                        episode_stats[g].episode_action_step += 1
                else:
                    action = npy_rng.randint(action_num)
                actions[g] = action
            # t0=time.time()
            def play_game(args):
                game,action = args
                game.play(action)

            for ret in parallel_executor.map(play_game, zip(games, actions)):
                pass

It seems the program only runs asynchronously for this two lines:

            for ret in parallel_executor.map(play_game, zip(games, actions)):
                pass

which does not fit the idea of asynchronization.

We may refer to this code to rewrite it: https://github.com/tflearn/tflearn/blob/master/examples/reinforcement_learning/atari_1step_qlearning.py

Thirdly, currently we need to manually add log file like: python dqn_dist_demo.py >> log.txt Consider to save log file into dir_path automatically.

Forthly, dqn_demo.py seems to be a duplicate of dqn_dist_demo.py in which kv_type=None. Consider to remove it or it is for other purposes?

I will try to do it first. Best, zmonoid

sxjscience commented 8 years ago

@zmonoid

For the box2d problem, one way is to add try-except blocks outside the import (https://github.com/peterzcc/Arena/blob/master/arena/games/cartpole_box2d.py#L5-L6). However, integrating OpenAI/Gym will be a much better solution. @flyers has done some works in his vrep branch (https://github.com/peterzcc/Arena/tree/vrep). I'll also try to integrate this after fixing the installation problem of OpenAI on windows.
This could be a simpler version of the algorithm in the ICML2016 paper. However, we have met some problems replicating the result in the paper. CC @peterzcc for this. We need to revise the code and include this in the final benchmark.
Yes. We need to revise the logging part (https://github.com/peterzcc/Arena/blob/master/dqn_dist_demo.py#L20).
The dqn_demo should be removed.

Thanks very much for pointing out these issues! We need to spend some time refactoring the code.

flyers commented 8 years ago

@zmonoid That CartpoleSwingupEnv environment is used to test the correctness of our actor-critic implementation so we copy this environment from another toolbox rllab https://github.com/rllab/rllab. We will use gym to replace it in the future. For the atari game part, shall we also move to gym? The current atari game class is actually a little complicated. Perhaps we can split the replay memory part out of the game itself.

zmonoid commented 8 years ago

@sxjscience @flyers Thanks very much for your response.

I'd like to contribute for the issues I proposed above and more, but in case of duplication of your works, you may kindly inform me your work on this or assign me task.

I'd like to revise the asynchronous one step Q learning to replicate the result in the paper.

sxjscience commented 8 years ago

@zmonoid Great! I'll next work on a new API for base and replay memory as well as implementing the natural policy gradient.

zmonoid commented 8 years ago

@sxjscience I encountered some difficulty when implementing asynchronous q learning.
It seems when using multi-threading to update the same network, I will receive some error. The error messages is here:

[2016-07-28 16:11:20,933] Making new env: Breakout-v0
[2016-07-28 16:11:20,954] Making new env: Breakout-v0
Thread 0 - Final epsilon: 0.5
Thread 1 - Final epsilon: 0.5
[16:11:27] /home/bzhou/mxnet/dmlc-core/include/dmlc/logging.h:235: [16:11:27] src/ndarray/ndarray.cc:227: Check failed: from.shape() == to->shape() operands shape mismatch
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/bzhou/anaconda/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/bzhou/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "dqn_async_demo.py", line 180, in actor_learner_thread
    dqn_reward=targets)
  File "/home/bzhou/svn/Arena/arena/base.py", line 204, in forward
    self.exe.arg_dict[k][:] = v
  File "/home/bzhou/anaconda/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/ndarray.py", line 220, in __setitem__
    value.copyto(self)
  File "/home/bzhou/anaconda/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/ndarray.py", line 458, in copyto
    return NDArray._copyto(self, out=other)
  File "/home/bzhou/anaconda/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/ndarray.py", line 1133, in unary_ndarray_function
    c_array(ctypes.c_char_p, [str(i).encode('ascii') for i in kwargs.values()])))
  File "/home/bzhou/anaconda/lib/python2.7/site-packages/mxnet-0.7.0-py2.7.egg/mxnet/base.py", line 77, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [16:11:27] src/ndarray/ndarray.cc:227: Check failed: from.shape() == to->shape() operands shape mismatch

('| Thread 01', '| Step', 295, '| Reward: 02', ' Qmax: 0.3533', ' Epsilon: 0.99963', ' Epsilon progress: 0.000737')
('| Thread 01', '| Step', 460, '| Reward: 00', ' Qmax: 0.3533', ' Epsilon: 0.99943', ' Epsilon progress: 0.001150')
('| Thread 01', '| Step', 629, '| Reward: 00', ' Qmax: 0.3533', ' Epsilon: 0.99921', ' Epsilon progress: 0.001572')

You may refer to my code here: https://github.com/zmonoid/Arena/blob/master/dqn_async_demo.py

This only happens with multi-threading to update network, set num_thread to zero or set is_train to false will not receive such error.

sxjscience commented 8 years ago

@zmonoid It's strange. It should have passed the assertion here https://github.com/peterzcc/Arena/blob/master/arena/base.py#L201-L203 .

zmonoid commented 8 years ago

@sxjscience Indeed it is strange. I thought it was caused by conflicts of multithreading operation, therefore I use threading.lock to lock other threads for the following code:

                    lock.acquire()
                    outputs = qnet.forward(is_train=True,
                                           data=states,
                                           dqn_action=actions,
                                           dqn_reward=targets)
                    qnet.backward()
                    qnet.update(updater=updater)
                    lock.release()

Well, this time, it does not pass the assertion any more:

 bzhou@bzhou-Desktop  ~/svn/Arena   master ●  python dqn_async_demo.py 
[2016-07-29 15:15:08,093] Making new env: Breakout-v0
[2016-07-29 15:15:08,113] Making new env: Breakout-v0
Thread 0 - Final epsilon: 0.1
Thread 1 - Final epsilon: 0.5
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/bzhou/anaconda/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/bzhou/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "dqn_async_demo.py", line 184, in actor_learner_thread
    dqn_reward=targets)
  File "/home/bzhou/svn/Arena/arena/base.py", line 203, in forward
    %(k, str(self.exe.arg_dict[k].shape), str(v.shape))
AssertionError: Shape not match: key data, need (1L, 4L, 84L, 84L), received (5L, 4L, 84L, 84L)

But the error information is confusing, since I define qnet like this:

action_repeat = 4
n_threads = 2
history_length = 4
I_AsyncUpdate = 5
I_target = 40000

    data_shapes = {'data': (I_AsyncUpdate, history_length) + (84, 84),
                   'dqn_action': (I_AsyncUpdate,), 'dqn_reward': (I_AsyncUpdate,)}

    optimizer = mx.optimizer.create(name='adagrad', learning_rate=0.01, eps=0.01,
                    clip_gradient=None,
                    rescale_grad=1.0, wd=0)

    updater = mx.optimizer.get_updater(optimizer)
    # Set up game environments (one per thread)
    num_actions = envs[0].action_space.n
    dqn_sym = dqn_sym_nature(num_actions)
    qnet = Base(data_shapes=data_shapes, sym_gen=dqn_sym, name='QNet',
                  initializer=DQNInitializer(factor_type="in"),
                  ctx=ctx)

QNet actually requires data shape (5, 4, 84, 84) by definition.

Is it because that during the forwards session to get action index from qnet I passed in data with shape (1, 4, 84, 84) thereafter changes the input data shape of QNET?

zmonoid commented 8 years ago

@sxjscience @flyers Problem solved. Needs to lock the forwards session as well. This is caused by the conflicts of different batch size feed into QNet. Now I am training with 8 threads, let's see if it will converge or not. >.<

sxjscience commented 8 years ago

@zmonoid Great! The logic of arena.Base is to store different data shape combinations in a dictionary and fetch the corresponding executor. Before we compute the forward and backward pass, we need to call switch_bucket to fetch (or create) an executor. https://github.com/peterzcc/Arena/blob/master/arena/base.py#L195-L196

sxjscience commented 8 years ago

@flyers @peterzcc @zmonoid I'm trying to revise the base class to enable more control of the executors. The new API will force the users to create/fetch executors by themselves when necessary.

peterzcc / Arena

Several issues on current version #9