Closed lbsswu closed 1 month ago
For simplicity, the classifier training code uses the same replay buffer and data format as the RL training code. So the pickle file should contain a list of transitions, where each transition is a dictionary with keys: observations
, next_observations
, actions
, rewards
, dones
, and masks
. Note that the observations and actions should match your environment's observation and action space, and only observations
is used to train the classifier. You can refer to the record_demo.py
script here for the correct format.
For simplicity, the classifier training code uses the same replay buffer and data format as the RL training code. So the pickle file should contain a list of transitions, where each transition is a dictionary with keys:
observations
,next_observations
,actions
,rewards
,dones
, andmasks
. Note that the observations and actions should match your environment's observation and action space, and onlyobservations
is used to train the classifier. You can refer to therecord_demo.py
script here for the correct format.
Thanks for your reply!
I'm confused about the positive and negative training data.
For the positive demonstration data here (https://github.com/rail-berkeley/serl/blob/21ff8a018d77ac8ee8505cfda11c567702ee70b0/examples/async_cable_route_drq/train_reward_classifier.py#L36), do you mean it include of only the last states of each successful trajectories? And the negative demonstration data is the remained states?
For instance, the task is picking up a cube, and there are 10 successful trajectories (episodes), and the length of each trajectory is 200, so we have 2000 states (observations). Furthermore, assume the task is done only at the last state (observation), so we have 10 positive data and 1990 negative data. Is this right?
if you run the data collection script, it will let you set how many positives you want to collect, usually you want to cover a diverse states of both successful and failed ones; e.g., maybe 200 success and 1000 failures
if you run the data collection script, it will let you set how many positives you want to collect, usually you want to cover a diverse states of both successful and failed ones; e.g., maybe 200 success and 1000 failures
okay, thanks!
what is the format of the dataset for training the binary reward classifier? can you provide some example pickle files?