factorize original data loader, so observation, rewards, actions have a cleaner loading processes. This effort includes (a) the loading of stepwise placeholders and batch data are separated (b) originally, action placeholders are defined outside of create_and_push_data_placeholders(), now this function combines all, so giving a better single API for all data creation (c) batch placeholders are always separated by policies.
(d) each action placeholder for sampler is in the shape of [env, agent, 1], this is more consistent with the action placeholder in the shape of [env, agent, num_action_types]
separate the model from the observation batch definition (old way, the observation batch is defined inside the model), so model could be more independent
create_and_push_data_placeholders()
, now this function combines all, so giving a better single API for all data creation (c) batch placeholders are always separated by policies. (d) each action placeholder for sampler is in the shape of [env, agent, 1], this is more consistent with the action placeholder in the shape of [env, agent, num_action_types]