factorize data loader and trainer for batches

factorize original data loader, so observation, rewards, actions have a cleaner loading processes. This effort includes (a) the loading of stepwise placeholders and batch data are separated (b) originally, action placeholders are defined outside of create_and_push_data_placeholders(), now this function combines all, so giving a better single API for all data creation (c) batch placeholders are always separated by policies. (d) each action placeholder for sampler is in the shape of [env, agent, 1], this is more consistent with the action placeholder in the shape of [env, agent, num_action_types]
separate the model from the observation batch definition (old way, the observation batch is defined inside the model), so model could be more independent
documentation for our batch definition
lightning api update

salesforce / warp-drive