Training block env sampling

lidongke commented 4 years ago

HI~

I focus on the sampling efficiencies, if i training a DQN agent, the nn will update every training_frequency timesteps. But when nn update , it will block the env sampling, how can i slove this problem? @kengz

kengz commented 4 years ago

Hi @lidongke , that's the nature of the control loop: sampling data and training are intertwined that way. There are a few ways to increase the sampling speed:

use vector environment, say num_envs = 8, so that each sampling step will pull from more environments.
hack the code to use asynchronous sampling: spawn a subprocess with the latest copy of the agent and asynchronously sample from the environment, and have the result ready in the next call of env.step. However this requires significant engineering effort in changing the code, and is likely not worth the effort for the broader use cases
since DQN is offline, u can save and use the data samples state, action, reward, next_state, done from previous runs and reload them in your new run. This should just require you to change your code to save the memory's content into a Python pickle file or numpy serialized data. You can also do some data selection on the side by filtering out data samples that are similar to prevent too many duplicates.

That said, we do not plan on supporting these features, but feel free to reach out if you have further questions about them!

lidongke commented 4 years ago

Asynchronous sampling is a good idea! I consider that i will spawn a subprocess with the "train()" function and use the "global_net" as the master agent ,only sample batches to optimize the "global_net" and periodically send the net parameters to the local agent by shared memory. What do you think about my plan? @kengz

kengz commented 4 years ago

Oh that's right! you can just use asynchronous parallelization a.k.a. Hogwild. Just set the spec variable meta.distributed to shared, and max_session as the number of workers, and run. Then it will do what you described above.

lidongke commented 4 years ago

If i use venv for batch actions and obs, i test the "shared" and "synced" mode in your src code are still serial sampling and training , Am i right? 1.In your comment: - 'synced': global network parameter is periodically synced to local network after each gradient push. In this mode, algorithm will keep a separate reference toglobal_{net}for each of its network I think this mean that each worker use local network to action selection, the master network async update(never block worker sampling) and periodically synced to local network. But in control.py:

def run_rl(self):
        '''Run the main RL loop until clock.max_frame'''
        logger.info(f'Running RL loop for trial {self.spec["meta"]["trial"]} session {self.index}')
        clock = self.env.clock
        state = self.env.reset()
        done = False
        while True:
            if util.epi_done(done):  # before starting another episode
                self.try_ckpt(self.agent, self.env)
                if clock.get() < clock.max_frame:  # reset and continue
                    clock.tick('epi')
                    state = self.env.reset()
                    done = False
            self.try_ckpt(self.agent, self.env)
            if clock.get() >= clock.max_frame:  # finish
                break
            clock.tick('t')
            with torch.no_grad():
                action = self.agent.act(state)
            next_state, reward, done, info = self.env.step(action)
            self.agent.update(state, action, reward, next_state, done)
state = next_state

If i use venv , the "act()" function will return batch actions for venv, that mean action selection and optimize in one process,so that action selection and optimize are serial ,i can't understand the "synced" mode with your comment.If i should change your code as my cognition? 2.I want to raise up sampling efficiencies,the spec variable max_session can run parallel sessions,but my probelm is each session is slow, when training start the cpu resource can't fully used,i hope the resource fully used~ @kengz

kengz / SLM-Lab

Training block env sampling #421